Spark version=Apache Spark 3.3.0
I'm using spark structured streaming to read and process comma-delimited CSV files. However, some records automatically get dropped in the target table, I'm assuming this is happening due to a column value having lots of character in it.
Edited: Read DF is reading data correctly, but at the time of writing the DF, a few rows get removed
The following comma-delimited row is a row that gets dropped automatically and there is more like this.
a500o0000008bugAAA,FALSE,KMI000004704,Key Medical Insight,0050o00000WuoSBAAZ,2020-04-02T10:17:02.000Z,0019000000R3GVDAA3,4/2/2020,"<XXXX XXXXX=XXXXX-XXXXX: XXX-XXXX;>XXXXXXXXXX XXXX XXXXXXX XXXX XX XXXXXXXXXXX XXXXXXXX XXXXXX XXXXX-XX. XXXXXX XXX XXX XXXXXXXXX XXX XXXXX XXXXXXXXXX XXXX XXXXXXX XXXXXXXXXXX XXX XXXX XXXXXXXX XX XXXXXXXXXX XXXXX XXXXXXXXXXXX XXXXX XX XXXXX XXXXX. XXXXXXXX XXXX XXXX XX XXXXX XXXXX XXXXXXXXXX XXXXXX XXXXX-XX XXXXX XXXXXXXX XXXXXX XXXXXXXXXX XXX XXXXXXX XXXX XXXXXXX XXXXXX XX XXXX XXXXXX XXXXXXXXXX XXX XXXXXXX XXXXX XXX. </XXXX><XXX><XXXX XXXXX=XXXXX-XXXXX: XXX-XXXX;><XX></XXXX></XXX><XXX><XXXX XXXXX=XXXXX-XXXXX: XXX-XXXX;>XXXXXXXXX XXXX XXXXXXX&#XX;X XX.X XXXXX XXXXXXXXX (XXX-XXXXXXXX) XXXXX XXX XXXX XXXXXXXX XXXX X XXXXXXXX - XXXX XXXXXXXX XX XXXXXX XXX (XXXXXX XXXX XXXXX XXX XXXXX) XX XXXX-XX-XXXX XXXXX XXX XXXXXXX XXXXXXX XXXXXXXXX XXX XXXXXXXXXX (XXXX XXX XXXXXXXX XXXXX).
XXXXXXX XXXXXXXXX XXXX XXXXXXXX XX XXXXXX XXX XXXXXX XXXX XXXX XX XXXXX XXXXXXX X.X. &XXXX;XXXXXXXX XXXXX XXXXXXXX XXXXX XXXXXXXXXX XXXX XXXX XXXX XXXXXXX XXXXXXXXX&XXXX;. </XXXX><XXX><XXXX XXXXX=XXXXX-XXXXX: XXX-XXXX;><XX></XXXX></XXX><XXX><XXXX XXXXX=XXXXX-XXXXX: XXX-XXXX;>XXXXX XXXXXXXXXX: XXXXXXX XXXXXXX XX XX XXXXX XXXXXXXXXXX XXXXX XXXXXXXXXX XX XXXX XXXXXXXXXX XX XXXXXXXXXX XXXXXXXX XXXXX XXX XXXXXXX XX XXXXXXXXX XXXXXXXXXX XXXXXXXXXXXX XX XXXX XXXXXXXXXX XXXXX XX XXXXXXXX XXXXXXXX XXXX XXX XX X XXXXX XXXX XXXX XXX XXXXXX XXXXXXXXXX XX XXXXXXXXXX XXXXX XXXXXXXXX. XX XXXXXXXX XXXX XXXXXXXX XXXXXXXXXXXX XXXXXXXXX XXX XXXXX XXXXXXXXX XXXXXX XXXXX-XX XXXXXX XXXXXXX XX XX XXXXX XXXX XXXXXXXXX XXXXXXXX XXX XXXXXXXXXX/XXXXXXX - XXXXXX XX XXX.</XXXX></XXX></XXX>",a040o00002QZ26mAAD,Submitted_vod,Local Data/ Fact/Observation,Key Opinion Leader,Hematology
note: The field with the "xxxxx" values is a single field having lots of spaces and characters etc.
I read the CSV file through the following code
def read_stream(container_read_path, file_format, delimeter, spark, header):
spark.conf.set("spark.sql.streaming.schemaInference", True)
source_data = (
spark.readStream.format(file_format)
.option("header", header)
.option("sep", delimeter)
.option("escape", "\"")
.option("multiline", True)
.option("recursiveFileLookup", "true")
.load(f"{container_read_path}")
)
return source_data
The following code use to save the structured streaming data
def write_stream(dataframe, database_name, table_name, checkpoint_path, partition_cols,
header, file_format='parquet'):
(dataframe
.writeStream
.format(file_format)
.trigger(once=True)
.option("checkpointLocation", f'{checkpoint_path}')
.foreachBatch(lambda df, epochId: write_raw_file(df, epochId, database_name,
table_name, partition_cols, header, file_format))
.start()
)
def write_raw_file(df, epochId, database_name, table_name, partition_cols, header,
file_format):
file_format = 'csv' if file_format == 'text' else file_format
header = "true" if file_format == 'csv' else "false"
(df.write
.mode("append")
.option("header", header)
.partitionBy(partition_cols)
.format(file_format)
.saveAsTable(f"{database_name}.{table_name}")
)
I resolved the issue by adding "option(multiLine, True)" to the write df.
def write_raw_file(df, epochId, database_name, table_name, partition_cols,
header, file_format):
file_format = 'csv' if file_format == 'text' else file_format
header = "true" if file_format == 'csv' else "false"
(df.write
.mode("append")
.option("header", header)
.option("multiline", True)
.partitionBy(partition_cols)
.format(file_format)
.saveAsTable(f"{database_name}.{table_name}")
)
What happing is that the column with the value "xxxx" has lots of spaces characters and multiple lines, spark ignores the rows which span multiple lines. I'm required to add "option(multiLine, True)" to both read df and write df to get the desired result.
There is missing information in the documentation of spark read CSV which is
significant to highlight that "option(multiLine, True)" is can be used in write df as well.
Related
Pandas Dataframe
I would like to rearrange the rows here in this dataframe, shown below.
The date is a date time index.
date ticker return
20050228 RIY xxxx
20050228 A xxxx
20050228 AAN xxxx
20050228 AAP xxxx
20050228 AAPL xxxx
.... ... ...
20050331 RIY xxxx
20050331 A xxxx
20050331 AAN xxxx
20050331 AAP xxxx
.... ... ...
20171231 RIY xxxx
20171231 A xxxx
20171231 AAN xxxx
20171231 AAP xxxx
The above is a panda dataframe of monthly returns of stocks.
But I would like to get an output that looks like below:
date ticker return
20050228 RIY xxxx
20050331 RIY xxxx
20050430 RIY xxxx
.... ... ...
20171231 RIY xxxx
20050228 A xxxx
20050331 A xxxx
20050430 A xxxx
.... ... ...
20171231 A xxxx
.... ... ...
How can I write the codes to get the above output?
df.reset_index().sort_values(['date', 'ticker'],ascending=['True','False']).set_index('date')
It seems like you want to sort your index and ticker column. Try the above code, which was repurposed from another Stack question: "Sorting a table values based on index and non-indexed columns using python".
I have C macros like:
#define MY_MACRO() \
xxxxx xxxxxxx \
xxxxxx xxxxxxx \
xx xxxxxxxxxx xxxxxx x xxxxxxx xxxxxxxx \
xxxxx \
xxxxxx xxxxxxx \
xxxxxx x xxxxxxxxx xx xxxxxxxxx x \
xxxxxxxxxx xxxxxxxxxxxxxx \
x xxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx \
xxxxxx x xxxxxx x xxxxxxxx x xxxxxx x xxxxxxxxx \
x \
xxx x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx \
xxxxxxxxxxxxxxxxxxxxxxxxx xxx \
xxxxx \
xxxxxx xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx x \
x xx xx xxxxxxx
And I'm trying to pad them to:
#define MY_MACRO() \
xxxxx xxxxxxx \
xxxxxx xxxxxxx \
xx xxxxxxxxxx xxxxxx x xxxxxxx xxxxxxxx \
xxxxx \
xxxxxx xxxxxxx \
xxxxxx x xxxxxxxxx xx xxxxxxxxx x \
xxxxxxxxxx xxxxxxxxxxxxxx \
x xxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx \
xxxxxx x xxxxxx x xxxxxxxx x xxxxxx x xxxxxxxxx \
x \
xxx x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx \
xxxxxxxxxxxxxxxxxxxxxxxxx xxx \
xxxxx \
xxxxxx xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx x \
x xx xx xxxxxxx
The lines should be 80 chars total.
'<,'>s/\(.*[^\s]\)\s*\\\s*$/\=printf('%-79s\', submatch(1))
mostly does it, except for lines such as:
xxxxxx xxxxxxx \
that exceed the 80 char limit to star with.
What am I doing wrong?
The printf() width specifier only adds padding to increase the string. If it is already larger than the width, it does not cut off anything.
Your regular expression attempts to limit the matching by excluding trailing whitespace from the match group. Unfortunately, this does not work: [^\s]. You cannot use atoms like \w inside a collection. Either use the corresponding character class ([^[:white:]]), or, if available, the negated atom: \S. With this fix, your substitution works just fine:
'<,'>s/\(.*\S\)\s*\\\s*$/\=printf('%-79s\', submatch(1))
For example:
I have a text as following:
Jul 11 xxxx xxxx start xxxxx
....
....
....
Jul 11 xxxx xxxx stop xxxxx
Jul 11 xxxx xxxx start xxxxx
....
....
....
Jul 11 xxxx xxxx stop xxxxx
....
now I want to split the above text file into different files based on "start" and "stop", like
/***text1.txt******/
Jul 11 xxxx xxxx start xxxxx
....
....
....
Jul 11 xxxx xxxx stop xxxxx
/***text2.txt******/
Jul 11 xxxx xxxx start xxxxx
....
....
....
Jul 11 xxxx xxxx stop xxxxx
How can I do that? Thanks.
This can make it:
$ awk '{if ($0 ~ /start/) a++} {print >> "file"a}' file
Explanation
{if ($0 ~ /start/) a++} looks for lines containing the word start. If so, increments the variable a, which is 0 by default.
{print >> "file"}' prints $0 (that is, the whole line) to a file called "file" in the same directory.
{print >> "file"a} prints the line to a file called "file" + variable a, which happens to be 0, 1, 2... So it prints to file1, file2...
Test
$ cat a
Jul 11 xxxx xxxx start xxxxx
....
....
....
Jul 11 xxxx xxxx stop xxxxx
Jul 11 xxxx xxxx start xxxxx
here begins 2nd file
....
....
....
Jul 11 xxxx xxxx stop xxxxx
$ awk '{if ($0 ~ /start/) {a++}} {print >> "file"a}' a
$ cat file1
Jul 11 xxxx xxxx start xxxxx
....
....
....
Jul 11 xxxx xxxx stop xxxxx
$ cat file2
Jul 11 xxxx xxxx start xxxxx
here begins 2nd file
....
....
....
Jul 11 xxxx xxxx stop xxxxx
With awk, 'if ($0 ~ )' is implicit.
awk 'BEGIN{a=0;b=0} /start/{a++;b=1} (b==1){print >> "file"a} /stop/{b=0}' input_file.txt
I have been searching everywhere for an online tool doing a simple but crucial thing:
Truncating each line in a document after x number of characters (output: each line having a maximum length of xx characters).
I would be very happy if someone could give me the URL for such tool!!
Example:
Document has 3 lines
It should truncate each line after 20 characters
INPUT:
Xxx xxx xxxxxx. Xxxxxx xxxxx, xxx xxx xxxx.
Xx xxx xxx xxxx xxxxxx xx.
Xxxxxxx.
OUTPUT:
Xxx xxx xxxxxx. Xxxx
Xx xxx xxx xxxx xxxx
Xxxxxxx.
What I want to find is: 1) an online tool where I paste the original text in one box, 2) enter the desired maximimum number of characters per line, 3) click on a button, 4) find the result in another box/or the same box.
Thanks for any help!
In OS X Terminal:
If you have the input data in a file at /path/to/file.txt:
$ cut -c 1-20 /path/to/file.txt
Xxx xxx xxxxxx. Xxxx
Xx xxx xxx xxxx xxxx
Xxxxxxx.
If you wish to enter the data interactively:
$ cut -c 1-20 <<EOF
> Xxx xxx xxxxxx. Xxxxxx xxxxx, xxx xxx xxxx.
> Xx xxx xxx xxxx xxxxxx xx.
> Xxxxxxx.
> EOF
Xxx xxx xxxxxx. Xxxx
Xx xxx xxx xxxx xxxx
Xxxxxxx.
I'd like to traverse a text file a pull out groups of lines at a time.
In the example below, I'd like to grep all lines below AAA but stop at bbb (ie all of the 'xxx')
Thanks
example:
-------AAA-------
xxx
xxx
xxx
xxx
xxx
-------bbb--------
yyy
yyy
yyy
yyy
------AAA---------
xxx
xxx
xxx
xxx
------bbb--------
yyy
if you don't care about inclusion of AAA and bbb lines, this should suffice for your example
$ awk '/AAA/,/bbb/' file
if you don't want AAA and bbb lines
$ awk '/bbb/{f=0}/AAA/{f=1;next}f{print}' file
Alternatively, if you have Ruby(1.9+)
$ ruby -0777 -ne 'puts $_.scan(/-+AAA-+(.*?)-+bbb-+/m) ' file