I am dealing with spark data frame df which has two columns tstamp and c_1. Data type for c_1 is 'string', and I want to add a new column by extracting string between two characters in that field.
For example: original dataframe df
tstamp
c_1
2022-06-15 10:00:00
xxx&cd7=H10S10P10&cd21=GA&cd3=6...
2022-06-15 10:10:01
xz&cd7=H11S11P11&cd21=CA&cd3=5...
We want to add a new column (same or another dataframe) called cd_7 and the value will be the string between 'cd7=' and '&cd21' like below:
tstamp
c_1
cd_7
2022-06-15 10:00:00
xxx&cd7=H10S10P10&cd21=GA&cd3=6...
H10S10P10
2022-06-15 10:10:01
xz&cd7=H11S11P11&cd21=CA&cd3=5...
H11S11P11
How could I write it using Pyspark? Thanks!
Use regex to extract everything between special characters = and &
df.withColumn('x', regexp_extract('c_1', '(?<=[\=]).*(?=[\&])',0)).show()
+-------------------+--------------------+---------+
| tstamp| c_1| x|
+-------------------+--------------------+---------+
|2022-06-15 10:00:00|xxx&cd7=H10S10P10...|H10S10P10|
|2022-06-15 10:10:01|xz&cd7=H11S11P11&...|H11S11P11|
+-------------------+--------------------+---------+
Used an alternative way to get the answer by converting this to a pandas dataframe and do data manipulation, but not ideal if data is large.
df['cd_7'] = df['c_1'].apply(lambda st: st[st.find("cd7=")+4:st.find("&cd21")])
Related
I have a dataframe which I can run split on a specific column and get a series - but how do I then add my other columns back into this dataframe? or do I somehow specify in the split that there's column a which is the groupBy then split on columnb ?
input:
ixd _id systemA systemB
0 abc123 1703.0|1144.0 2172.0|735.0
output:
pandas series data (not expanded) for systemA and B split on '|' groupedBy _id
It sounds like a regular .groupby will achieve what you are after:
for specific_value, subset_df in df.groupby(column_of_interest):
...
The subset_df will be a pandas dataframe containing only rows for which column_of_interest contains specific_value.
I have an array in the format [27.214 27.566] - there can be several numbers. Additionally I have a Datetime variable.
now=datetime.now()
datetime=now.strftime('%Y-%m-%d %H:%M:%S')
time.sleep(0.5)
agilent.write("MEAS:TEMP? (#101:102)")
values = np.fromstring(agilent.read(), dtype=float, sep=',')
The output from the array is [27.214 27.566]
Now I would like to write this into a dataframe with the following structure:
Datetime, FirstValueArray, SecondValueArray, ....
How to do this? In the dataframe every one minute a new array is added.
I will assume you want to append a row to an existing dataframe df with appropriate columns : value1, value2, ..., lastvalue, datetime
We can easily convert the array to a series :
s = pd.Series(array)
What you want to do next is append the datetime value to the series :
s.append(datetime, ignore_index=True) cf Series.append
Now you have a series whose length matches df.columns. You want to convert that series to a dataframe to be able to use pd.concat :
df_to_append = s.to_frame().T
We need to get the transpose of the original dataframe, because Series.to_frame() returns a dataframe with the series as a single column, and we want a single index but multiple columns.
Before you concatenate, however, you need to make sure both those dataframes columns names match, or it will create additional columns :
df_to_append.columns = df.columns
Now we can concatenate our two dataframes :
pd.concat([df, df_to_append], ignore_index=True) cf pandas.Concat
For further details, see the documentation
I have a excel file with multilevel data and i need to melt them into a single level column
df = pd.read_excel('test.xlsx')
df.to_excel('test1.xlsx')
I need the dataframe output to look like below
Geo PC Month A B C Total
Jan-19
Feb-19
Consider using pandas.melt?
From the docs
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.
I am currently trying to figure out, how to pass the String - format argument to the to_date pyspark function via a column parameter.
Specifically, I have the following setup:
sc = SparkContext.getOrCreate()
df = sc.parallelize([('a','2018-01-01','yyyy-MM-dd'),
('b','2018-02-02','yyyy-MM-dd'),
('c','02-02-2018','dd-MM-yyyy')]).toDF(
["col_name","value","format"])
I am currently trying to add a new column, where each of the dates from the column F.col("value"), which is a string value, is parsed to a date.
Separately for each format, this can be done with
df = df.withColumn("test1",F.to_date(F.col("value"),"yyyy-MM-dd")).\
withColumn("test2",F.to_date(F.col("value"),"dd-MM-yyyy"))
This however gives me 2 new columns - but I want to have 1 column containing both results - but calling the column does not seem to be possible with the to_date function:
df = df.withColumn("test3",F.to_date(F.col("value"),F.col("format")))
Here an error "Column object not callable" is being thrown.
Is is possible to have a generic approach for all possible formats (so that I do not have to manually add new columns for each format)?
You can use a column value as a parameter without a udf using the spark-sql syntax:
Spark version 2.2 and above
from pyspark.sql.functions import expr
df.withColumn("test3",expr("to_date(value, format)")).show()
#+--------+----------+----------+----------+
#|col_name| value| format| test3|
#+--------+----------+----------+----------+
#| a|2018-01-01|yyyy-MM-dd|2018-01-01|
#| b|2018-02-02|yyyy-MM-dd|2018-02-02|
#| c|02-02-2018|dd-MM-yyyy|2018-02-02|
#+--------+----------+----------+----------+
Or equivalently using pyspark-sql:
df.createOrReplaceTempView("df")
spark.sql("select *, to_date(value, format) as test3 from df").show()
Spark version 1.5 and above
Older versions of spark do not support having a format argument to the to_date function, so you'll have to use unix_timestamp and from_unixtime:
from pyspark.sql.functions import expr
df.withColumn(
"test3",
expr("from_unixtime(unix_timestamp(value,format))").cast("date")
).show()
Or equivalently using pyspark-sql:
df.createOrReplaceTempView("df")
spark.sql(
"select *, cast(from_unixtime(unix_timestamp(value,format)) as date) as test3 from df"
).show()
As far as I know, your problem requires some udf(user defined functions) to apply the correct format. But then inside a udf you can not directly use spark functions like to_date. So I created a little workaround in the solution. First the udf takes the python date conversion with the appropriate format from the column and converts it to an iso-format. Then another withColumn converts the iso-date to the correct format in column test3. However, you have to adapt the format in the original column to match the python dateformat strings, e.g. yyyy -> %Y, MM -> %m, ...
test_df = spark.createDataFrame([
('a','2018-01-01','%Y-%m-%d'),
('b','2018-02-02','%Y-%m-%d'),
('c','02-02-2018','%d-%m-%Y')
], ("col_name","value","format"))
def map_to_date(s,format):
return datetime.datetime.strptime(s,format).isoformat()
myudf = udf(map_to_date)
test_df.withColumn("test3",myudf(col("value"),col("format")))\
.withColumn("test3",to_date("test3")).show(truncate=False)
Result:
+--------+----------+--------+----------+
|col_name|value |format |test3 |
+--------+----------+--------+----------+
|a |2018-01-01|%Y-%m-%d|2018-01-01|
|b |2018-02-02|%Y-%m-%d|2018-02-02|
|c |02-02-2018|%d-%m-%Y|2018-02-02|
+--------+----------+--------+----------+
You dont need the format column also. You can use coalesce to check for all possible options
def get_right_date_format(date_string):
from pyspark.sql import functions as F
return F.coalesce(
F.to_date(date_string, 'yyyy-MM-dd'),
F.to_date(date_string, 'dd-MM-yyyy'),
F.to_date(date_string, 'yyyy-dd-MM')
)
df = sc.parallelize([('a','2018-01-01'),
('b','2018-02-02'),
('c','2018-21-02'),
('d','02-02-2018')]).toDF(
["col_name","value"])
df = df.withColumn("formatted_data",get_right_date_format(df.value, 'dd-MM-yyyy'))
The issue with this approach though is a date like 2020-02-01 would be treated as 1st Feb 2020, when it is likely that 2nd Jan 2020 is also possible.
Just an alternative approach !!!
I am trying to convert a Dataframe to RDD in order to explode the map (with key-value pair) into different row.
Info = sqlContext.read.format("csv"). \
option("delimiter","\t"). \
option("header", "True"). \
option("inferSchema", "True"). \
load("file.tsv")
DataFrame[ID: int, Date: timestamp, Comments: string]
The sample data in the DF is as follows.
ID Date Comments
1 2015-04-30 22:42:49.0 {44:'xxxxxxxx'}
2 2015-05-06 08:53:18.0 {83:'aaaaaaaaa', 175:'bbbbbbbbb', 86:'cccccccccc'}
3 2015-05-13 19:57:13.0 {487:'yyyyyyyyyyy', 48:'zzzzzzzzzzzzzz'}
Now, the comments are already in key-value pairs but it is read as a string, I want to explode each key-value pair into a different row. For e.g.
Expected OUTPUT
ID Date Comments
1 2015-04-30 22:42:49.0 {44:'xxxxxxxx'}
2 2015-05-06 08:53:18.0 {83:'aaaaaaaaa'}
2 2015-05-06 08:53:18.0 {175:'bbbbbbbbb'}
2 2015-05-06 08:53:18.0 {86:'cccccccccc'}
3 2015-05-13 19:57:13.0 {487:'yyyyyyyyyyy'}
3 2015-05-13 19:57:13.0 {48:'zzzzzzzzzzzzzz'}
I have tried to convert it to a RDD and apply flatMap but to no success. I want all columns to be returned. I have tried this:
Info.rdd.flatMap(lambda x: (x['SearchParams'].split(':'), x))
Use the provided split and explode functions in the DataFrame API to split the data on ",". To create the map, you want to use create_map. This function expects two separate columns as input. Here below is an example were two temporary columns are created (again using split):
Info.withColumn("Comments", explode(split(col("Comments"), ", ")))
.withColumn("key", split(col("Comments"), ":").getItem(0))
.withColumn("value", split(col("Comments"), ":").getItem(1))
.withColumn("Comments", create_map(col("key"), col("value")))
It should be possible to make this shorter like this (not tested):
Info.withColumn("Comments", split(explode(split(col("Comments), ", ")), ":")
.withColumn("Comments", create_map(col("Comments".getItem(0)), col("Comments").getItem(1)))