Concat 1 to n items into new spark column - apache-spark

I try to have a dynamic concat of fields, based on some configuration settings the goal is to have a new fields with merged values of 1 to n fields.
language = "JP;EN"
language = list(str(item) for item in language.split(";"))
no_langs = len(language)
# check if columns for multi-language exists
for lang in language:
doc_lang = "doctor.name_" + lang
if doc_lang not in case_df.columns:
case_df_final = AddColumn(case_df, doc_lang)
### combine translations of masterdata
case_df = case_df.withColumn(
"doctor",
F.concat(
F.col(("doctor.name_" + language[0])),
F.lit(" // "),
F.col(("doctor.name_" + language[1])),
),
)
What I would like to achieve is that the new column is dynamic depending of the amount of languages configured. E.g. If only one language is used the result would be like this.
case_df = case_df.withColumn(
"doctor",
F.col(("doctor.name_" + lang[0]))
)
For 2 languages or more it should pick all the languages based on the order in the list.
Thanks for your help.
I am using Spark 2.4. with Python 3
The expected output would be the following

Final working code is the following:
# check if columns for multi-language exists
for lang in language:
doc_lang = "doctor.name_" + lang
if doc_lang not in case_df.columns:
case_df = AddColumn(case_df, doc_lang)
doc_lang_new = doc_lang.replace(".", "_")
case_df = case_df.withColumnRenamed(doc_lang, doc_lang_new)
doc_fields = list(map(lambda k: "doctor_name_" + k, language))
case_df = case_df.withColumn("doctor", F.concat_ws(" // ", *doc_fields))
Thanks all for the help and hints.

Related

Calculate percentage change in pandas with rows that contain the same values

I am using Pandas to calculate percentage change(s) between values that occur more than once in the column of interest.
I want to compare the values of last weeks workout provided they're the same exercise type to get the percentage change of (weight used, reps accomplished )
I am able to get the percentages of all the rows which is halfway what I want but the conditional part is missing - so only get the percentages if the exercise_name is of the same value as we want to compare how we improve on a weekly, bi-weekly basis.
ids = self.user_data["exercise"].fillna(0)
dups = self.user_data[ids.isin(ids[ids.duplicated()])].sort_values("exercise")
dups['exercise'] = dups['exercise'].astype(str)
dups['set_one_weight'] = pd.to_numeric(dups['set_one_weight'])
dups['set_two_weight'] = pd.to_numeric(dups['set_two_weight'])
dups['set_three_weight'] = pd.to_numeric(dups['set_three_weight'])
dups['set_four_weight'] = pd.to_numeric(dups['set_four_weight'])
dups['set_one'] = pd.to_numeric(dups['set_one'])
dups['set_two'] = pd.to_numeric(dups['set_two'])
dups['set_three'] = pd.to_numeric(dups['set_three'])
dups['set_four'] = pd.to_numeric(dups['set_four'])
**percent_change = dups[['set_three_weight']].pct_change()**
the last line gets the percentage change for all the rows for column set_three_weight but is unable to do what I want above which is find rows with same name and obtain the percentage change.
UPDATE
Using Group By Solution
ids = self.user_data["exercise"].fillna(0)
dups = self.user_data[ids.isin(ids[ids.duplicated()])].sort_values("exercise")
dups['exercise'] = dups['exercise'].astype(str)
dups['set_one_weight'] = pd.to_numeric(dups['set_one_weight'])
dups['set_two_weight'] = pd.to_numeric(dups['set_two_weight'])
dups['set_three_weight'] = pd.to_numeric(dups['set_three_weight'])
dups['set_four_weight'] = pd.to_numeric(dups['set_four_weight'])
dups['set_one'] = pd.to_numeric(dups['set_one'])
dups['set_two'] = pd.to_numeric(dups['set_two'])
dups['set_three'] = pd.to_numeric(dups['set_three'])
dups['set_four'] = pd.to_numeric(dups['set_four'])
dups['routine_upload_date'] = pd.to_datetime(dups['routine_upload_date'])
# percent_change = dups[['set_three_weight']].pct_change()
# Group the exercises together and create a new cols that represent the percentage delta variation in percentages
dups.sort_values(['exercise', 'routine_upload_date'], inplace=True, ascending=[True, False])
dups['set_one_weight_delta'] = (dups.groupby('exercise')['set_one_weight'].apply(pd.Series.pct_change) + 1)
dups['set_two_weight_delta'] = (dups.groupby('exercise')['set_two_weight'].apply(pd.Series.pct_change) + 1)
dups['set_three_weight_delta'] = (dups.groupby('exercise')['set_three_weight'].apply(pd.Series.pct_change) + 1)
dups['set_four_weight_delta'] = (dups.groupby('exercise')['set_four_weight'].apply(pd.Series.pct_change) + 1)
dups['set_one_reps_delta'] = (dups.groupby('exercise')['set_one'].apply(pd.Series.pct_change) + 1)
dups['set_two_reps_delta'] = (dups.groupby('exercise')['set_two'].apply(pd.Series.pct_change) + 1)
dups['set_three_reps_delta'] = (dups.groupby('exercise')['set_three'].apply(pd.Series.pct_change) + 1)
dups['set_four_reps_delta'] = (dups.groupby('exercise')['set_four'].apply(pd.Series.pct_change) + 1)
print(dups.head())
I think this gets me the result(s) I want, would like someone to confirm

Using STUFF Function

I've this table TableA that have these fields: [intIdEntidad],[intIdEjercicio],[idTipoGrupoCons]. The tableA look like for idTipoGrupoCons = 16 this image
enter image description here
I'm trying to use STUFF function to show the column intIdEjercicio separated by coma, something like this;
enter image description here
This is query I'm using to obtain result the above image:
SELECT DISTINCT o.idTipoGrupoCons, o.intIdEntidad, ejercicios= STUFF((
SELECT ', ' + CONVERT(VARCHAR,a.intIdEjercicio)
FROM dbo.[tbEntidades_Privadas_InfoAdicionalGrupo] AS a
WHERE a.idTipoGrupoCons = 16
FOR XML PATH, TYPE).value(N'.[1]', N'varchar(max)'), 1, 2, '')
FROM [tbEntidades_Privadas_InfoAdicionalGrupo] AS o
JOIN tbEntidades_Privadas p On O.intIdEntidad = p.intIdEntidad
WHERE o.idTipoGrupoCons = 16
The result isn't correct, because I execute this query for idTipoGrupoCons = 16
SELECT [idTipoGrupoCons], [intIdEntidad],[intIdEjercicio]
FROM [tbEntidades_Privadas_InfoAdicionalGrupo] A
WHERE A.idTipoGrupoCons = 16
The result is this
enter image description here
It's means that for intIdEntidad = 50 intIdEjercicio is just 7 and for intIdEntidad = 45 intIdEjercicio = 2 and 4
I suppose that the problem is that I need to add a subquery to or a function into STUFF or in the outer WHERE to add condition to intIdEntidad each time to call STUFF function.
I've read about the use of CROSS APPLY and perhaps it can be used to solve the problem
Here is the answer.
The problem was that need to join tableA with the same table into the STUFF function. At the end the query look like this:
SELECT t1.idTipoGrupoCons, t1.intIdEntidad,
,ejercicios = STUFF(
(SELECT ', ' + t3.Ejercicio
FROM [tbEntidades_Privadas_InfoAdicionalGrupo] t2
JOIN tbMtoNoRegistro_Ejercicios t3 ON t2.intIdEjercicio = e.intEjercicio
WHERE t2.idTipoGrupoCons = t1.idTipoGrupoCons
AND t2.intIdEntidad = t1.intIdEntidad
ORDER BY t3.Ejercicio
FOR XML PATH ('')
)
,1,2,'')
FROM [tbEntidades_Privadas_InfoAdicionalGrupo] t1
JOIN tbEntidades_Privadas p ON t1.intIdEntidad = p.intIdEntidad
WHERE t1.idTipoGrupoCons = 17
GROUP BY t1.idTipoGrupoCons,t1.intIdEntidad, p.strDenominacionSocial

Updating multiple line plots dynamically in callback in bokeh

I have a use case where I have multiple line plots (with legends), and I need to update the line plots based on a column condition. Below is an example of two data set, based on the country, the column data source changes. But the issue I am facing is, the number of columns is not fixed for the data source, and even the types can vary. So, when I update the data source based on a callback when there is a new country selected, I get this error:
Error: attempted to retrieve property array for nonexistent field 'pay_conv_7d.content'.
I am guessing because in the new data source, the pay_conv_7d.content column doesn't exist, but in my plot those lines were already there. I have been trying to fix this issue by various means (making common columns for all country selection - adding the missing column in the data source in callback, but still get issues.
Is there any clean way to have multiple line plots updating using callback, and not do a lot of hackish way? Any insights or help would be really appreciated. Thanks much in advance! :)
def setup_multiline_plots(x_axis, y_axis, title_text, data_source, plot):
num_categories = len(data_source.data['categories'])
legends_list = list(data_source.data['categories'])
colors_list = Spectral11[0:num_categories]
# xs = [data_source.data['%s.'%x_axis].values] * num_categories
# ys = [data_source.data[('%s.%s')%(y_axis,column)] for column in data_source.data['categories']]
# data_source.data['x_series'] = xs
# data_source.data['y_series'] = ys
# plot.multi_line('x_series', 'y_series', line_color=colors_list,legend='categories', line_width=3, source=data_source)
plot_list = []
for (colr, leg, column) in zip(colors_list, legends_list, data_source.data['categories']):
xs, ys = '%s.'%x_axis, ('%s.%s')%(y_axis,column)
plot.line(xs,ys, source=data_source, color=colr, legend=leg, line_width=3, name=ys)
plot_list.append(ys)
data_source.data['plot_names'] = data_source.data.get('plot_names',[]) + plot_list
plot.title.text = title_text
def update_plot(country, timeseries_df, timeseries_source,
aggregate_df, aggregate_source, category,
plot_pay_7d, plot_r_pay_90d):
aggregate_metrics = aggregate_df.loc[aggregate_df.country == country]
aggregate_metrics = aggregate_metrics.nlargest(10, 'cost')
category_types = list(aggregate_metrics[category].unique())
timeseries_df = timeseries_df[timeseries_df[category].isin(category_types)]
timeseries_multi_line_metrics = get_multiline_column_datasource(timeseries_df, category, country)
# len_series = len(timeseries_multi_line_metrics.data['time.'])
# previous_legends = timeseries_source.data['plot_names']
# current_legends = timeseries_multi_line_metrics.data.keys()
# common_legends = list(set(previous_legends) & set(current_legends))
# additional_legends_list = list(set(previous_legends) - set(current_legends))
# for legend in additional_legends_list:
# zeros = pd.Series(np.array([0] * len_series), name=legend)
# timeseries_multi_line_metrics.add(zeros, legend)
# timeseries_multi_line_metrics.data['plot_names'] = previous_legends
timeseries_source.data = timeseries_multi_line_metrics.data
aggregate_source.data = aggregate_source.from_df(aggregate_metrics)
def get_multiline_column_datasource(df, category, country):
df_country = df[df.country == country]
df_pivoted = pd.DataFrame(df_country.pivot_table(index='time', columns=category, aggfunc=np.sum).reset_index())
df_pivoted.columns = df_pivoted.columns.to_series().str.join('.')
categories = list(set([column.split('.')[1] for column in list(df_pivoted.columns)]))[1:]
data_source = ColumnDataSource(df_pivoted)
data_source.data['categories'] = categories
Recently I had to update data on a Multiline glyph. Check my question if you want to take a look at my algorithm.
I think you can update a ColumnDataSource in three ways at least:
You can create a dataframe to instantiate a new CDS
cds = ColumnDataSource(df_pivoted)
data_source.data = cds.data
You can create a dictionary and assign it to the data attribute directly
d = {
'xs0': [[7.0, 986.0], [17.0, 6.0], [7.0, 67.0]],
'ys0': [[79.0, 69.0], [179.0, 169.0], [729.0, 69.0]],
'xs1': [[17.0, 166.0], [17.0, 116.0], [17.0, 126.0]],
'ys1': [[179.0, 169.0], [179.0, 1169.0], [1729.0, 169.0]],
'xs2': [[27.0, 276.0], [27.0, 216.0], [27.0, 226.0]],
'ys2': [[279.0, 269.0], [279.0, 2619.0], [2579.0, 2569.0]]
}
data_source.data = d
Here if you need different sizes of columns or empty columns you can fill the gaps with NaN values in order to keep column sizes. And I think this is the solution to your question:
import numpy as np
d = {
'xs0': [[7.0, 986.0], [17.0, 6.0], [7.0, 67.0]],
'ys0': [[79.0, 69.0], [179.0, 169.0], [729.0, 69.0]],
'xs1': [[17.0, 166.0], [np.nan], [np.nan]],
'ys1': [[179.0, 169.0], [np.nan], [np.nan]],
'xs2': [[np.nan], [np.nan], [np.nan]],
'ys2': [[np.nan], [np.nan], [np.nan]]
}
data_source.data = d
Or if you only need to modify a few values then you can use the method patch. Check the documentation here.
The following example shows how to patch entire column elements. In this case,
source = ColumnDataSource(data=dict(foo=[10, 20, 30], bar=[100, 200, 300]))
patches = {
'foo' : [ (slice(2), [11, 12]) ],
'bar' : [ (0, 101), (2, 301) ],
}
source.patch(patches)
After this operation, the value of the source.data will be:
dict(foo=[11, 22, 30], bar=[101, 200, 301])
NOTE: It is important to make the update in one go to avoid performance issues

Python 3, create new dictionary using list and another dictionary

Could you help me wiht my issue ? Let's say that I have few list with ID's their members, like below:
team_A = [1,2,3,4,5]
team_B = [6,7,8,9,10]
team_C = [11,12,13,14,15]
and now I have a dictionary with their values:
dictionary = {5:23, 10:68, 15:68, 4:1, 9:37, 14:21, 3:987, 8:3, 13:14, 2:98, 7:74, 12:47, 1:37, 6:82, 11:99}
I would like to take correct elements from dictionary and create new dictionary for team A, B and C, like below:
team_A_values = {5:23, 4:1, 3:987, 2:98, 1:37}
Could you give advice how to do that ? Thanks for your help
You can do something like below by just Iterating through the lists
team_A = [1,2,3,4,5]
team_B = [6,7,8,9,10]
team_C = [11,12,13,14,15]
dictionary = {5:23, 10:68, 15:68, 4:1, 9:37, 14:21, 3:987, 8:3, 13:14, 2:98, 7:74, 12:47, 1:37, 6:82, 11:99}
team_A_values = {}
for i in team_A:
team_A_values[i] = dictionary[i]
print(team_A_values )
can repeat this to team B and team C
in that case you can do like this
team_values = [{i: dictionary[i] for i in team_A },{i: dictionary[i] for i in team_B},{i: dictionary[i] for i in team_C}]
teamA,teamB,teamC = team_values
print(team_values)
print(teamA)
print(teamB)
print(teamC)
in one line you can do like this
team_values = [{i: dictionary[i] for i in team } for team in [team_A ,team_B ,team_C]]
teamA,teamB,teamC = team_values
print(team_values)
print(teamA)
print(teamB)
print(teamC)

Split a string on sqlite

i don't know sqlite but I have to implement a database already done. I'm programming with Corona SDK. The problem: i have a column called "answers" in this format: House,40|Bed,20|Mirror,10 ecc.
I want to split the string and remove "," "|" like this:
VARIABLE A=House
VARIABLE A1=40
VARIABLE B=Bed
VARIABLE B1=20
VARIABLE C=Mirror
VARIABLE C1=10
I'm sorry for my english. Thanks to everybody.
Try this:
If you want to simply remove the characters, then you can use the following:
Update 3 :
local myString = "House;home;flat,40|Bed;bunk,20|Mirror,10"
local myTable = {}
local tempTable = {}
local count_1 = 0
for word in string.gmatch(myString, "([^,|]+)") do
myTable[#myTable+1]=word
count_1=count_1+1
tempTable[count_1] = {} -- Multi Dimensional Array
local count_2 = 0
for word_ in string.gmatch(myTable[#myTable], "([^,|,;]+)") do
count_2=count_2+1
local str_ = word_
tempTable[count_1][count_2] = str_
--print(count_1.."|"..count_2.."|"..str_)
end
end
print("------------------------")
local myTable = {} -- Resetting my table, just for using it again :)
for i=1,count_1 do
for j=1,#tempTable[i] do
print("tempTable["..i.."]["..j.."] = "..tempTable[i][j])
if(j==1)then myTable[i] = tempTable[i][j] end
end
end
print("------------------------")
for i=1,#myTable do
print("myTable["..i.."] = "..myTable[i])
end
--[[ So now you will have a multidimensional array tempTable with
elements as:
tempTable = {{House,home,flat},
{40},
{Bed,bunk},
{20},
{Mirror},
{10}}
So you can simply take any random/desired value from each.
I am taking any of the 3 from the string "House,home,flat" and
assigning it to var1 below:
--]]
var1 = tempTable[1][math.random(3)]
print("var1 ="..var1)
-- So, as per your need, you can check var1 as:
for i=1,#tempTable[1] do -- #tempTable[1] means the count of array 'tempTable[1]'
if(var1==tempTable[1][i])then
print("Ok")
break;
end
end
----------------------------------------------------------------
-- Here you can print myTable(if needed) --
----------------------------------------------------------------
for i=1,#myTable do
print("myTable["..i.."]="..myTable[i])
end
--[[ The output is as follows:
myTable[1]=House
myTable[2]=40
myTable[3]=Bed
myTable[4]=20
myTable[5]=Mirror
myTable[6]=10
Is it is what you are looking for..?
]]--
----------------------------------------------------------------
Keep coding............. :)

Resources