pyspark only escapes the first quote after a backslash from csv - apache-spark

I have a csv data in which I've escaped existing backslash characters using another backslash:
year,content
2021,"\\"foo\\",bar"
I'd like to read it using spark and display the data. The expected data on the dataframe:
+----------+--------------+
| year | content |
+----------+--------------+
| 2021 | \"foo\",bar |
+----------+--------------+
But when I ran this:
schema = StructType(
[
StructField("year", IntegerType(), False),
StructField("content", StringType(), False),
]
)
df = (
spark.read.csv(
f"s3://path/to/csv",
schema=schema,
header=True
)
)
df.show(20,False)
I'm getting this:
+----------+--------------+
| year | content |
+----------+--------------+
| 2021 | "\"foo\\" |
+----------+--------------+
Any idea how to handle this properly?

If you can save your data like this (single escape instead of double escape)
year,content
2021,"\"foo\", bar"
then you can read it as you're doing and the output would be
# +----+----------+
# |year|content |
# +----+----------+
# |2021|"foo", bar|
# +----+----------+

Related

Parse `key1=value1 key2=value2` in Kusto

I'm running Cilium inside an Azure Kubernetes Cluster and want to parse the cilium log messages in the Azure Log Analytics. The log messages have a format like
key1=value1 key2=value2 key3="if the value contains spaces, it's wrapped in quotation marks"
For example:
level=info msg="Identity of endpoint changed" containerID=a4566a3e5f datapathPolicyRevision=0
I couldn't find a matching parse_xxx method in the docs (e.g. https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/parsecsvfunction ). Is there a possibility to write a custom function to parse this kind of log messages?
Not a fun format to parse... But this should work:
let LogLine = "level=info msg=\"Identity of endpoint changed\" containerID=a4566a3e5f datapathPolicyRevision=0";
print LogLine
| extend KeyValuePairs = array_concat(
extract_all("([a-zA-Z_]+)=([a-zA-Z0-9_]+)", LogLine),
extract_all("([a-zA-Z_]+)=\"([a-zA-Z0-9_ ]+)\"", LogLine))
| mv-apply KeyValuePairs on
(
extend p = pack(tostring(KeyValuePairs[0]), tostring(KeyValuePairs[1]))
| summarize dict=make_bag(p)
)
The output will be:
| print_0 | dict |
|--------------------|-----------------------------------------|
| level=info msg=... | { |
| | "level": "info", |
| | "containerID": "a4566a3e5f", |
| | "datapathPolicyRevision": "0", |
| | "msg": "Identity of endpoint changed" |
| | } |
|--------------------|-----------------------------------------|
With the help of Slavik N, I came with a query that works for me:
let containerIds = KubePodInventory
| where Namespace startswith "cilium"
| distinct ContainerID
| summarize make_set(ContainerID);
ContainerLog
| where ContainerID in (containerIds)
| extend KeyValuePairs = array_concat(
extract_all("([a-zA-Z0-9_-]+)=([^ \"]+)", LogEntry),
extract_all("([a-zA-Z0-9_]+)=\"([^\"]+)\"", LogEntry))
| mv-apply KeyValuePairs on
(
extend p = pack(tostring(KeyValuePairs[0]), tostring(KeyValuePairs[1]))
| summarize JSONKeyValuePairs=parse_json(make_bag(p))
)
| project TimeGenerated, Level=JSONKeyValuePairs.level, Message=JSONKeyValuePairs.msg, PodName=JSONKeyValuePairs.k8sPodName, Reason=JSONKeyValuePairs.reason, Controller=JSONKeyValuePairs.controller, ContainerID=JSONKeyValuePairs.containerID, Labels=JSONKeyValuePairs.labels, Raw=LogEntry

invalid string interpolation: `$$', `$'ident or `$'BlockExpr expected -> Spark SQL

The error I am getting:
invalid string interpolation: `$$', `$'ident or `$'BlockExpr expected
Spark SQL:
val sql =
s"""
|SELECT
| ,CAC.engine
| ,CAC.user_email
| ,CAC.submit_time
| ,CAC.end_time
| ,CAC.duration
| ,CAC.counter_name
| ,CAC.counter_value
| ,CAC.usage_hour
| ,CAC.event_date
|FROM
| xyz.command AS CAC
| INNER JOIN
| (
| SELECT DISTINCT replace(split(get_json_object(metadata_payload, '$.configuration.name'), '_')[1], 'acc', '') AS account_id
| FROM xyz.metadata
| ) AS QCM
| ON QCM.account_id = CAC.account_id
|WHERE
| CAC.event_date BETWEEN '2019-10-01' AND '2019-10-05'
|""".stripMargin
val df = spark.sql(sql)
df.show(10, false)
You added s prefix which means you want the string be interpolated. It means all tokens prefixed with $ will be replaced with the local variable with the same name. From you code it looks like you do not use this feature, so you could just remove s prefix from the string:
val sql =
"""
|SELECT
| ,CAC.engine
| ,CAC.user_email
| ,CAC.submit_time
| ,CAC.end_time
| ,CAC.duration
| ,CAC.counter_name
| ,CAC.counter_value
| ,CAC.usage_hour
| ,CAC.event_date
|FROM
| xyz.command AS CAC
| INNER JOIN
| (
| SELECT DISTINCT replace(split(get_json_object(metadata_payload, '$.configuration.name'), '_')[1], 'acc', '') AS account_id
| FROM xyz.metadata
| ) AS QCM
| ON QCM.account_id = CAC.account_id
|WHERE
| CAC.event_date BETWEEN '2019-10-01' AND '2019-10-05'
|""".stripMargin
Otherwise if you really need the interpolation you have to quote $ sign like this:
val sql =
s"""
|SELECT
| ,CAC.engine
| ,CAC.user_email
| ,CAC.submit_time
| ,CAC.end_time
| ,CAC.duration
| ,CAC.counter_name
| ,CAC.counter_value
| ,CAC.usage_hour
| ,CAC.event_date
|FROM
| xyz.command AS CAC
| INNER JOIN
| (
| SELECT DISTINCT replace(split(get_json_object(metadata_payload, '$$.configuration.name'), '_')[1], 'acc', '') AS account_id
| FROM xyz.metadata
| ) AS QCM
| ON QCM.account_id = CAC.account_id
|WHERE
| CAC.event_date BETWEEN '2019-10-01' AND '2019-10-05'
|""".stripMargin

How to convert a SAP .txt extraction into a .csv file

I have a .txt file as in the example reported below. I would like to convert it into a .csv table, but I'm not having much success.
Mack3 Line Item Journal Time 14:22:33 Date 03.10.2015
Panteni Ledger 1L TGEPIO00/CANTINAOAS Page 20.001
--------------------------------------------------------------------------------------------------------------------------------------------
| Pstng Date|Entry Date|DocumentNo|Itm|Doc..Date |BusA|PK|SG|Sl|Account |User Name |LCurr| Amount in LC|Tx|Assignment |S|
|------------------------------------------------------------------------------------------------------------------------------------------|
| 07.01.2014|07.02.2014|4919005298| 36|07.01.2019| |81| | |60532640 |tARFooWMOND |EUR | 0,85 | |20140107 | |
| 07.01.2014|07.02.2014|4919065298| 29|07.01.2019| |81| | |60532640 |tARFooWMOND |EUR | 2,53 | |20140107 | |
| 07.01.2014|07.02.2014|4919235298| 30|07.01.2019| |81| | |60532640 |tARFooWMOND |EUR | 30,00 | |20140107 | |
| 07.01.2014|07.02.2014|4119005298| 32|07.01.2019| |81| | |60532640 |tARFooWMOND |EUR | 1,00 | |20140107 | |
| 07.01.2014|07.02.2014|9019005298| 34|07.01.2019| |81| | |60532640 |tARFooWMOND |EUR | 11,10 | |20140107 | |
|------------------------------------------------------------------------------------------------------------------------------------------|
The file in question is structure as a report from SAP. Practicing with python and looking in other posts I found this code:
with open('file.txt', 'rb') as f_input:
for line in filter(lambda x: len(x) > 2 and x[0] == '|' and x[1].isalpha(), f_input):
header = [cols.strip() for cols in next(csv.reader(StringIO(line), delimiter='|', skipinitialspace=True))][1:-1]
break
with open('file.txt', 'rb') as f_input, open(str(ii + 1) + 'output.csv', 'wb') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(header)
for line in filter(lambda x: len(x) > 2 and x[0] == '|' and x[1] != '-' and not x[1].isalpha(), f_input):
csv_input = csv.reader(StringIO(line), delimiter='|', skipinitialspace=True)
csv_output.writerow(csv_input)
Unfortunately it does not work for my case. In fact it creates empty .csv files and it seems to not read properly the csv_input.
Any possible solution?
Your input file can be treated as CSV once we filter out a few lines, namely the ones that do not start with a pipe symbol '|' followed by a space ' ', which would leave us with this:
| Pstng Date|Entry Date|DocumentNo|Itm|Doc..Date |BusA|PK|SG|Sl|Account |User Name |LCurr| Amount in LC|Tx|Assignment |S|
| 07.01.2014|07.02.2014|4919005298| 36|07.01.2019| |81| | |60532640 |tARFooWMOND |EUR | 0,85 | |20140107 | |
| 07.01.2014|07.02.2014|4919065298| 29|07.01.2019| |81| | |60532640 |tARFooWMOND |EUR | 2,53 | |20140107 | |
| 07.01.2014|07.02.2014|4919235298| 30|07.01.2019| |81| | |60532640 |tARFooWMOND |EUR | 30,00 | |20140107 | |
| 07.01.2014|07.02.2014|4119005298| 32|07.01.2019| |81| | |60532640 |tARFooWMOND |EUR | 1,00 | |20140107 | |
| 07.01.2014|07.02.2014|9019005298| 34|07.01.2019| |81| | |60532640 |tARFooWMOND |EUR | 11,10 | |20140107 | |
Your output is mainly empty because that x[1].isalpha() check is never true on this data. The character in position 1 on each line is always a space, never alphabetic.
It's not necessary to open the input file multiple times, we can read, filter and write to the output in one go:
import csv
ii = 0
with open('file.txt', 'r', encoding='utf8', newline='') as f_input, \
open(str(ii + 1) + 'output.csv', 'w', encoding='utf8', newline='') as f_output:
input_lines = filter(lambda x: len(x) > 2 and x[0] == '|' and x[1] == ' ', f_input)
csv_input = csv.reader(input_lines, delimiter='|')
csv_output = csv.writer(f_output)
for row in csv_input:
csv_output.writerow(col.strip() for col in row[1:-1])
Notes:
You should not use binary mode when reading text files. Use r and w modes, respectively, and explicitly declare the file encoding. Choose the encoding that is the right one for your files.
For work with the csv module, open files with newline='' (which lets the csv module pick the correct line endings)
You can wrap multiple files in the with statements using the \ at the end of the line.
StringIO is completely unnecesary.
I'm not using skipinitialspace=True because some of the columns also have spaces at the end. Therefore I'm calling .strip() manually on each value when writing the row.
The [1:-1] is necessary to get rid of the superfluous empty columns (before the first and after the last | in the input)
Output is as follows
Pstng Date,Entry Date,DocumentNo,Itm,Doc..Date,BusA,PK,SG,Sl,Account,User Name,LCurr,Amount in LC,Tx,Assignment,S
07.01.2014,07.02.2014,4919005298,36,07.01.2019,,81,,,60532640,tARFooWMOND,EUR,"0,85",,20140107,
07.01.2014,07.02.2014,4919065298,29,07.01.2019,,81,,,60532640,tARFooWMOND,EUR,"2,53",,20140107,
07.01.2014,07.02.2014,4919235298,30,07.01.2019,,81,,,60532640,tARFooWMOND,EUR,"30,00",,20140107,
07.01.2014,07.02.2014,4119005298,32,07.01.2019,,81,,,60532640,tARFooWMOND,EUR,"1,00",,20140107,
07.01.2014,07.02.2014,9019005298,34,07.01.2019,,81,,,60532640,tARFooWMOND,EUR,"11,10",,20140107,

Write pyspark dataframe to file keeping nested quotes, but not "outer" ones?

Is there a way to preserve nested quotes in pyspark dataframe value when writing to file (in my case, a TSV) while also getting rid of the "outer" ones (ie. those that denote a string value in a column)?
>>> dff = sparkSession.createDataFrame([(10,'this is "a test"'), (14,''), (16,'')], ["age", "comments"])
>>> dff.show()
+---+----------------+
|age| comments|
+---+----------------+
| 10|this is "a test"|
| 14| |
| 16| |
+---+----------------+
>>> dff.write\
.mode('overwrite')\
.option("sep", "\t")\
.option("quoteAll", "false")\
.option("emptyValue", "").option("nullValue", "")\
.csv('/tmp/test')
then
$ cat /tmp/test/part-000*
10 "this is \"a test\""
14
16
# what I'd want to see is
10 this is "a test"
14
16
# because I am later parsing based only on TAB characters, so the quote sequences are not a problem in that regard
Is there any way to write the dataframe in this desired format?
* as aside, more info about the args used can be found here
Set the escapeQuotes option to false:
>>> dff = spark.createDataFrame([(10,'this is "a test"'), (14,''), (16,'')], ["age", "comments"])
>>> dff.show()
+---+----------------+
|age| comments|
+---+----------------+
| 10|this is "a test"|
| 14| |
| 16| |
+---+----------------+
>>> dff.write\
... .mode('overwrite')\
... .option("sep", "\t")\
... .option("quoteAll", "false")\
... .option("emptyValue", "").option("nullValue", "")\
... .option("escapeQuotes", "false").csv('/tmp/test')
>>>
➜ ~ cd /tmp/test
➜ test ls
_SUCCESS part-00001-f702e661-15c2-4ab9-aef2-8dad5d923412-c000.csv part-00003-f702e661-15c2-4ab9-aef2-8dad5d923412-c000.csv
part-00000-f702e661-15c2-4ab9-aef2-8dad5d923412-c000.csv part-00002-f702e661-15c2-4ab9-aef2-8dad5d923412-c000.csv
➜ test cat part*
10 this is "a test"
14
16
➜ test

How to create a data frame by based on the date value passed as a string in pyspark?

I have a Data set like below:
file : test.txt
149|898|20180405
135|379|20180428
135|381|20180406
31|898|20180429
31|245|20180430
135|398|20180422
31|448|20180420
31|338|20180421
I have created data frame by executing below code.
spark = SparkSession.builder.appName("test").getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
df_transac = spark.createDataFrame(sc.textFile("test.txt")\
.map(lambda x: x.split("|")[:3])\
.map(lambda r: Row('cCode'= r[0],'pCode'= r[1],'mDate' = r[2])))
df_transac .show()
+-----+-----+----------+
|cCode|pCode| mDate|
+-----+-----+----------+
| 149| 898| 20180405 |
| 135| 379| 20180428 |
| 135| 381| 20180406 |
| 31| 898| 20180429 |
| 31| 245| 20180430 |
| 135| 398| 20180422 |
| 31| 448| 20180420 |
| 31| 338| 20180421 |
+-----+-----+----------+
my df.printSchemashow like below:
df_transac.printSchema()
root
|-- customerCode: string (nullable = true)
|-- productCode: string (nullable = true)
|-- quantity: string (nullable = true)
|-- date: string (nullable = true)
but I want to create a data frame based my input dates i.e date1="20180425" date2="20180501"
my expected output is:
+-----+-----+----------+
|cCode|pCode| mDate|
+-----+-----+----------+
| 135| 379| 20180428 |
| 31| 898| 20180429 |
| 31| 245| 20180430 |
+-----+-----+----------+
please help on this how can I achieve this.
Here is a simple filter applied to your df :
df_transac.where("mdate between '{}' and '{}'".format(date1,date2)).show()
+-----+-----+--------+
|cCode|pCode| mDate|
+-----+-----+--------+
| 135| 379|20180428|
| 31| 898|20180429|
| 31| 245|20180430|
+-----+-----+--------+

Resources