How can I format the below data into tabular form using Python ?
Is there any way to print/write the data as per the expected format ?
[{"itemcode":null,"productname":"PKS543452","value_2018":null},
{"itemcode":null,"productname":"JHBG6%&9","value_2018":null},
{"itemcode":null,"productname":"VATER3456","value_2018":null},
{"itemcode":null,"productname":"ACDFER3434","value_2018":null}]
Expected output:
|itemcode | Productname | Value_2018 |
|null |PKS543452|null|
|null |JHBG6%&9|null|
|null |VATER3456|null|
|null |ACDFER3434|null|
You can use pandas to generate a dataframe from the list of dictionaries:
import pandas as pd
null = "null"
lst = [{"itemcode":null,"productname":"PKS543452","value_2018":null},
{"itemcode":null,"productname":"JHBG6%&9","value_2018":null},
{"itemcode":null,"productname":"VATER3456","value_2018":null},
{"itemcode":null,"productname":"ACDFER3434","value_2018":null}]
df = pd.DataFrame.from_dict(lst)
print(df)
Output:
itemcode productname value_2018
0 null PKS543452 null
1 null JHBG6%&9 null
2 null VATER3456 null
3 null ACDFER3434 null
This makes it easy to manipulate data in the table later on. Otherwise, you can print your desired output using built-in string methods:
output=[]
col_names = '|' + ' | '.join(lst[0].keys()) + '|'
print(col_names)
for dic in lst:
row = '|' + ' | '.join(dic.values()) + '|'
print(row)
Output:
|itemcode | productname | value_2018|
|null | PKS543452 | null|
|null | JHBG6%&9 | null|
|null | VATER3456 | null|
|null | ACDFER3434 | null|
You can try like this as well (without using pandas). I have commented each and every line in code itself so don't forget to read them.
Note: Actually, the list/array that you have have pasted is either the result of json.dumps() (in Python, a text) or you have copied the API response (JSON).
null is from JavaScript and the pasted list/array is not a valid Python list but it can be considered as text and converted back to Python list using json.loads(). In this case, null will be converted to None.
And that's why to form the wanted o/p we need another check like "null" if d[key] is None else d[key].
import json
# `null` is used in JavaScript (JSON is JavaScript), so I considered it as string
json_text = """[{"itemcode":null,"productname":"PKS543452","value_2018":null},
{"itemcode":null,"productname":"JHBG6%&9","value_2018":null},
{"itemcode":null,"productname":"VATER3456","value_2018":null},
{"itemcode":null,"productname":"ACDFER3434","value_2018":null}]"""
# Will contain the rows (text)
texts = []
# Converting to original list object, `null`(JavaScript) will transform to `None`(Python)
l = json.loads(json_text)
# Obtain keys (Note that dictionary is an unorederd data type)
# So it is imp to get keys for ordered iteration in all dictionaries of list
# Column may be in different position but related data will be perfect
# If you wish you can hard code the `keys`, here I am getting using `l[0].keys()`
keys = l[0].keys()
# Form header and add to `texts` list
header = '|' + ' | '.join(keys) + " |"
texts.append(header)
# Form body (rows) and append to `texts` list
rows = ['| ' + "|".join(["null" if d[key] is None else d[key] for key in keys]) + "|" for d in l]
texts.extend(rows)
# Print all rows (including header) separated by newline '\n'
answer = '\n'.join(texts)
print(answer)
Output
|itemcode | productname | value_2018 |
| null|PKS543452|null|
| null|JHBG6%&9|null|
| null|VATER3456|null|
| null|ACDFER3434|null|
Related
I have a string as below where in need to select only the last part of the string using pyspark
Input
/dbfs/mnt/abc/date=20210224/fsp_store_abcxyz_lmn_
/dbfs/mnt/abc/date=20210224/fsp_store_schu_lev_bsd_s_
Output
fsp_store_abcxyz_lmn_
fsp_store_schu_lev_bsd_s_
i am using the code below
from pyspark.sql.functions import *
df=spark.sql("""select stack(1,"/dbfs/mnt/abc/date=20210224/fsp_store_abcxyz_lmn_","/dbfs/mnt/abc/date=20210224/fsp_store_schu_lev_bsd_s_") as (txt)""")
df.withColumn("extract",regexp_extract(col("txt"),"_(.*)",1)).display(10,False)
and my output is
store_abcxyz_lmn
store_schu_lev_bsd_s
however my requirement is
fsp_store_schu_lev_bsd_s_
fsp_store_abcxyz_lmn_
could you please help in the above challenge
Try this:
df = spark.createDataFrame([{"Path": "/dbfs/mnt/abc/date=20210224/fsp_store_abcxyz_lmn_"},
{"Path": "/dbfs/mnt/abc/date=20210224/fsp_store_schu_lev_bsd_s_ "}])
df.show(2, False)
+------------------------------------------------------+
|Path |
+------------------------------------------------------+
|/dbfs/mnt/abc/date=20210224/fsp_store_abcxyz_lmn_ |
|/dbfs/mnt/abc/date=20210224/fsp_store_schu_lev_bsd_s_ |
+------------------------------------------------------+
df.withColumn("result", F.reverse(F.split(F.reverse("Path"), "/")[0])).show(2, False)
+------------------------------------------------------+--------------------------+
|Path |result |
+------------------------------------------------------+--------------------------+
|/dbfs/mnt/abc/date=20210224/fsp_store_abcxyz_lmn_ |fsp_store_abcxyz_lmn_ |
|/dbfs/mnt/abc/date=20210224/fsp_store_schu_lev_bsd_s_ |fsp_store_schu_lev_bsd_s_ |
+------------------------------------------------------+--------------------------+
I have a PySpark data frame with a string column(URL) and all records look in the following way
ID URL
1 https://app.xyz.com/inboxes/136636/conversations/2686735685
2 https://app.xyz.com/inboxes/136636/conversations/2938415796
3 https://app.drift.com/inboxes/136636/conversations/2938419189
I want to basically extract the number after conversations/ from URL column using regex into another column.
I tried the following code but it doesn't give me any results.
df1 = df.withColumn('CONV_ID', split(convo_influ_new['URL'], '(?<=conversations/).*').getItem(0))
Expected:
ID URL CONV_ID
1 https://app.xyz.com/inboxes/136636/conversations/2686735685 2686735685
2 https://app.xyz.com/inboxes/136636/conversations/2938415796 2938415796
3 https://app.drift.com/inboxes/136636/conversations/2938419189 2938419189
Result:
ID URL CONV_ID
1 https://app.xyz.com/inboxes/136636/conversations/2686735685 https://app.xyz.com/inboxes/136636/conversations/2686735685
2 https://app.xyz.com/inboxes/136636/conversations/2938415796 https://app.xyz.com/inboxes/136636/conversations/2938415796
3 https://app.drift.com/inboxes/136636/conversations/2938419189 https://app.drift.com/inboxes/136636/conversations/2938419189
Not sure what's happening here. I tried the regex script in different online regex tester toolds and it highlights the part I want but never works in PySpark. I tried different PySpark functions like f.split, regexp_extract, regexp_replace, but none of them work.
If you are URLs have always that form, you can actually just use substring_index to get the last path element :
import pyspark.sql.functions as F
df1 = df.withColumn("CONV_ID", F.substring_index("URL", "/", -1))
df1.show(truncate=False)
#+---+-------------------------------------------------------------+----------+
#|ID |URL |CONV_ID |
#+---+-------------------------------------------------------------+----------+
#|1 |https://app.xyz.com/inboxes/136636/conversations/2686735685 |2686735685|
#|2 |https://app.xyz.com/inboxes/136636/conversations/2938415796 |2938415796|
#|3 |https://app.drift.com/inboxes/136636/conversations/2938419189|2938419189|
#+---+-------------------------------------------------------------+----------+
You can use regexp_extract instead:
import pyspark.sql.functions as F
df1 = df.withColumn(
'CONV_ID',
F.regexp_extract('URL', 'conversations/(.*)', 1)
)
df1.show()
+---+--------------------+----------+
| ID| URL| CONV_ID|
+---+--------------------+----------+
| 1|https://app.xyz.c...|2686735685|
| 2|https://app.xyz.c...|2938415796|
| 3|https://app.drift...|2938419189|
+---+--------------------+----------+
Or if you want to use split, you don't need to specify .*. You just need to specify the pattern used for splitting.
import pyspark.sql.functions as F
df1 = df.withColumn(
'CONV_ID',
F.split('URL', '(?<=conversations/)')[1] # just using 'conversations/' should also be enough
)
df1.show()
+---+--------------------+----------+
| ID| URL| CONV_ID|
+---+--------------------+----------+
| 1|https://app.xyz.c...|2686735685|
| 2|https://app.xyz.c...|2938415796|
| 3|https://app.drift...|2938419189|
+---+--------------------+----------+
I have a text. From each line I want to filter everything after some stop word. For example :
stop_words=['with','is', '/']
One of the rows is:
senior manager with experience
I want to remove everything after with (including with) so the output is:
senior manager
I have big-data and am working with Spark in Python.
You can find the location of the stop words using instr, and get a substring up to that location.
import pyspark.sql.functions as F
stop_words = ['with', 'is', '/']
df = spark.createDataFrame([
['senior manager with experience'],
['is good'],
['xxx//'],
['other text']
]).toDF('col')
df.show(truncate=False)
+------------------------------+
|col |
+------------------------------+
|senior manager with experience|
|is good |
|xxx // |
|other text |
+------------------------------+
df2 = df.withColumn('idx',
F.coalesce(
# Get the smallest index of a stop word in the string
F.least(*[F.when(F.instr('col', s) != 0, F.instr('col', s)) for s in stop_words]),
# If no stop words found, get the whole string
F.length('col') + 1)
).selectExpr('trim(substring(col, 1, idx-1)) col')
df2.show()
+--------------+
| col|
+--------------+
|senior manager|
| |
| xxx|
| other text|
+--------------+
You can use udf and get index of first occurrence of stop word in col, then again using one more udf, you can substring col message.
val df = List("senior manager with experience", "is good", "xxx//", "other text").toDF("col")
val index_udf = udf ( (col_value :String ) => {val result = for (elem <- stop_words; if col_value.contains(elem)) yield col_value.indexOf(elem)
if (result.isEmpty) col_value.length else result.min } )
val substr_udf = udf((elem:String, index:Int) => elem.substring(0, index))
val df3 = df.withColumn("index", index_udf($"col")).withColumn("substr_message", substr_udf($"col", $"index")).select($"substr_message").withColumnRenamed("substr_message", "col")
df3.show()
+---------------+
| col|
+---------------+
|senior manager |
| |
| xxx|
| other text|
+---------------+
I'm having trouble spliting a dataframe's column into more columns in PySpark:
I have a list of lists and I want to transform it into a dataframe, each value in one column.
What I have tried:
I created a dataframe from this list:
[['COL-4560', 'COL-9655', 'NWG-0610', 'D81-3754'],
['DLL-7760', 'NAT-9885', 'PED-0550', 'MAR-0004', 'LLL-5554']]
Using this code:
from pyspark.sql import Row
R = Row('col1', 'col2')
# use enumerate to add the ID column
df_from_list = spark.createDataFrame([R(i, x) for i, x in enumerate(recs_list)])
The result I got is:
+----+--------------------+
|col1| col2|
+----+--------------------+
| 0|[COL-4560, COL-96...|
| 1|[DLL-7760, NAT-98...|
+----+--------------------+
I want to separate the values by comma into columns, so I tried:
from pyspark.sql import functions as F
df2 = df_from_list.select('col1', F.split('col2', ', ').alias('col2'))
# If you don't know the number of columns:
df_sizes = df2.select(F.size('col2').alias('col2'))
df_max = df_sizes.agg(F.max('col2'))
nb_columns = df_max.collect()[0][0]
df_result = df2.select('col1', *[df2['col2'][i] for i in range(nb_columns)])
df_result.show()
But I get an error on this line df2 = df_from_list.select('col1', F.split('col2', ', ').alias('col2')):
AnalysisException: cannot resolve 'split(`col2`, ', ', -1)' due to data type mismatch: argument 1 requires string type, however, '`col2`' is of array<string> type.;;
My ideal final output would be like this:
+----------+----------+----------+----------+----------+
| SKU | REC_01 | REC_02 | REC_03 | REC_04 |
+----------+----------+----------+----------+----------+
| COL-4560 | COL-9655 | NWG-0610 | D81-3754 | null |
| DLL-7760 | NAT-9885 | PED-0550 | MAR-0004 | LLL-5554 |
+---------------------+----------+----------+----------+
Some rows may have four values, but some my have more or less, I don't know the exact number of columns the final dataframe will have.
Does anyone have any idea of what is happening? Thank you very much in advance.
Dataframe df_from_list col2 column is already array type, so no need to split (as split works with stringtype here we have arraytype).
Here are the steps that will work for you.
recs_list=[['COL-4560', 'COL-9655', 'NWG-0610', 'D81-3754'],
['DLL-7760', 'NAT-9885', 'PED-0550', 'MAR-0004', 'LLL-5554']]
from pyspark.sql import Row
R = Row('col1', 'col2')
# use enumerate to add the ID column
df_from_list = spark.createDataFrame([R(i, x) for i, x in enumerate(recs_list)])
from pyspark.sql import functions as F
df2 = df_from_list
# If you don't know the number of columns:
df_sizes = df2.select(F.size('col2').alias('col2'))
df_max = df_sizes.agg(F.max('col2'))
nb_columns = df_max.collect()[0][0]
cols=['SKU','REC_01','REC_02','REC_03','REC_04']
df_result = df2.select(*[df2['col2'][i] for i in range(nb_columns)]).toDF(*cols)
df_result.show()
#+--------+--------+--------+--------+--------+
#| SKU| REC_01| REC_02| REC_03| REC_04|
#+--------+--------+--------+--------+--------+
#|COL-4560|COL-9655|NWG-0610|D81-3754| null|
#|DLL-7760|NAT-9885|PED-0550|MAR-0004|LLL-5554|
#+--------+--------+--------+--------+--------+
I have the following dataframe in pyspark:
+------------------------------------------------------------+
|probability |
+------------------------------------------------------------+
|[0.27047928569511825,0.5312608102025099,0.19825990410237174]|
|[0.06711381377029987,0.8775456658890036,0.05534052034069637]|
|[0.10847074295048188,0.04602848157663474,0.8455007754728833]|
+------------------------------------------------------------+
and I want to get the largest, 2-largest value and their index:
+-------------------------------------------------------------------------------------------------------------- -----+
|probability | largest_1 |index_1|largest_2 |index_2 |
+------------------------------------------------------------|------------------|-------|-------------------|--------+
|[0.27047928569511825,0.5312608102025099,0.19825990410237174]|0.5312608102025099| 1 |0.27047928569511825| 0 |
|[0.06711381377029987,0.8775456658890036,0.05534052034069637]|0.8775456658890036| 1 |0.06711381377029987| 0 |
|[0.10847074295048188,0.04602848157663474,0.8455007754728833]|0.8455007754728833| 2 |0.10847074295048188| 0 |
+--------------------------------------------------------------------------------------------------------------------+
Here is another way using transform (require spark 2.4+) to convert array of doubles into array of structs containing value and index of each item in the original array, sort_array(by descending), and then take the first N:
from pyspark.sql.functions import expr
df.withColumn('d', expr('sort_array(transform(probability, (x,i) -> (x as val, i as idx)), False)')) \
.selectExpr(
'probability',
'd[0].val as largest_1',
'd[0].idx as index_1',
'd[1].val as largest_2',
'd[1].idx as index_2'
).show(truncate=False)
+--------------------------------------------------------------+------------------+-------+-------------------+-------+
|probability |largest_1 |index_1|largest_2 |index_2|
+--------------------------------------------------------------+------------------+-------+-------------------+-------+
|[0.27047928569511825, 0.5312608102025099, 0.19825990410237174]|0.5312608102025099|1 |0.27047928569511825|0 |
|[0.06711381377029987, 0.8775456658890036, 0.05534052034069637]|0.8775456658890036|1 |0.06711381377029987|0 |
|[0.10847074295048188, 0.04602848157663474, 0.8455007754728833]|0.8455007754728833|2 |0.10847074295048188|0 |
+--------------------------------------------------------------+------------------+-------+-------------------+-------+
From Spark-2.4+
You can use array_sort and array_position built in functions for this case.
Example:
df=spark.sql("select array(0.27047928569511825,0.5312608102025099,0.19825990410237174) probability union select array(0.06711381377029987,0.8775456658890036,0.05534052034069637) prbability union select array(0.10847074295048188,0.04602848157663474,0.8455007754728833) probability")
#DataFrame[probability: array<decimal(17,17)>]
#sample data
df.show(10,False)
#+---------------------------------------------------------------+
#|probability |
#+---------------------------------------------------------------+
#|[0.06711381377029987, 0.87754566588900360, 0.05534052034069637]|
#|[0.27047928569511825, 0.53126081020250990, 0.19825990410237174]|
#|[0.10847074295048188, 0.04602848157663474, 0.84550077547288330]|
#+---------------------------------------------------------------+
df.withColumn("sort_arr",array_sort(col("probability"))).\
withColumn("largest_1",element_at(col("sort_arr"),-1)).\
withColumn("largest_2",element_at(col("sort_arr"),-2)).\
selectExpr("*","array_position(probability,largest_1) -1 index_1","array_position(probability,largest_2) -1 index_2").\
drop("sort_arr").\
show(10,False)
#+---------------------------------------------------------------+-------------------+-------------------+-------+-------+
#|probability |largest_1 |largest_2 |index_1|index_2|
#+---------------------------------------------------------------+-------------------+-------------------+-------+-------+
#|[0.06711381377029987, 0.87754566588900360, 0.05534052034069637]|0.87754566588900360|0.06711381377029987|1 |0 |
#|[0.27047928569511825, 0.53126081020250990, 0.19825990410237174]|0.53126081020250990|0.27047928569511825|1 |0 |
#|[0.10847074295048188, 0.04602848157663474, 0.84550077547288330]|0.84550077547288330|0.10847074295048188|2 |0 |
#+---------------------------------------------------------------+-------------------+-------------------+-------+-------+