I have the below Apache Spark Dataframe df_result where it has one column Name.
df_result.repartition(1).write.option('header','true').option("delimiter","|").option("escape", "\\").mode("overwrite").csv(path)
In this column, the values are like below.
Name
.....
John
Mathew\M
In the second row, there is a \ character. When I export this to csv using the above script, it generates the value as Mathew\M in the file. Ideally, I need the value as Mathew\\M in the file (ie, single \ should be replaced with \\). Is there a way to do this using the option or any other ways?
Am using Apache Spark 3.2.1.
Does this help? Seems to work for me
df.withColumn('Name', regexp_replace('Name','\\\\',r'\1\\\\')).write.option('header','true').option("delimiter","|").option("escape", "\\").mode("overwrite").csv("/tmp/output/wk/bn.cv")
Related
I have a Pyspark dataframe that has commas in one of the field.
Sample data:
+--------+------------------------------------------------------------------------------------+
|id |reason |
+--------+------------------------------------------------------------------------------------+
|123-8aab|Request for the added "Hello Ruth, How are, you, doing and Other" abc. Thanks! |
|125-5afs|Hi Prachi, I added an "XYZA", like josa.coam, row. "Uid to be eligible" for clarity.|
+--------+------------------------------------------------------------------------------------+
When I am writing this in csv, the data is spilling on to the next column and is not represented correctly. Code I am using to write data and output:
df_csv.repartition(1).write.format('csv').option("header", "true").save(
"s3://{}/report-csv".format(bucket_name), mode='overwrite')
How data appears in csv:
Any help would really be appreciated. TIA.
NOTE : I think if the field has just commas, its exporting properly, but the combination of quotes and commas is what is causing the issue.
What worked for me was-->
df_csv.repartition(1).write.format('csv').option("header", "true").option("quote", "\"").option("escape", "\"").save("s3://{}/report-csv".format(bucket_name), mode='overwrite')
More detailed explanation in this post:
Reading csv files with quoted fields containing embedded commas
I have a dataframe which am trying to store into database like below
oversampled_df.write \
.format('jdbc') \
.option('truncate', 'true') \
.options(url=EXT_DB_URL,
driver='oracle.jdbc.driver.OracleDriver',
dbtable=DEST_DB_TBL_NAME) \
.mode('overwrite') \
.save()
yet it keeps adding double quotes " to column names, how can I remove this to be able to query from the tables without including them i.e.
instead of
select "description" from schema.table;
to be
select description from schema.table;
I faced the same problem too, my workaround is
create the table manually in oracle
CREATE TABLE schema_name.table_name(
table_catalog VARCHAR2(255 BYTE),
table_schema VARCHAR2(255 BYTE)
)
add option("truncate", "true")
oversampled_df.write.format('jdbc').options(
url='jdbc:oracle:thin:schema/user#ip:port/dbname',
driver='oracle.jdbc.driver.OracleDriver',
dbtable='schema_name.table_name',
user='user',
password='password').option("truncate", "true")\
.mode('overwrite').save()
worked for me, hoped it helps
From the sounds of it, it looks like Oracle is treating your column names as quoted identifiers where to query against it, you need it to use the double quotes and it is also case sensitive. A workaround I found was making sure that all the columns in your DataFrame are capitalised (they can have digits and underscores as well) before saving to Oracle so it treats them as nonquoted identifiers. Them, you should be able to query them in either lower or upper case e.g. DESCRIPTION or description, without the need for double quotes.
I am trying to find and segregate some rows whose certain columns don't follow certain pattern. I found the following example from databricks document to identify and check column values are integer or not and write the bad records into json file.
I want to identify whether one column values are like 1,245.00 and bad records will be like 1.245,00.
The values can vary the number of digits and just want to check whether data follows patter like 1,245.00 in pyspark.
Sometimes in the raw data, commas and dots are inter changed.
Can someone tell me how to collect such records in badrecordpath as in the following example?
// Creates a json file containing both parsable and corrupted records
Seq("""{"a": 1, "b": 2}""", """{bad-record""").toDF().write.text("/tmp/input/jsonFile")
val df = spark.read
.option("badRecordsPath", "/tmp/badRecordsPath")
.schema("a int, b int")
.json("/tmp/input/jsonFile")
df.show()
The above example is in scala and am looking for pyspark solution if possible. Thanks.
Please find some examples as below (with 2 last decimal points)
1,245.00
3,5000.80
6.700,00
5.7364732.20
4,500,600.00
dataframe with following data (with compliance) should have dot and two digits decimals
1,245.00
3,5000.80
4,500,600.00
Illegal data points should be kept in badRecordsPath (a comma before the decimal point)
6.700,00
5.7364732,20
Thanks
I have ingested data using sqoop in hdfs , however my data contains comma ',' in single columns. when I am using the same data in spark then it is taking each comma as separator. what can I do to change these comma's?
suppose if you have xyz column inside I have a,b,c in first line and cd in second line then what I can do to avoid these comma's?
While importing data in text format, default field delimiter is comma(,). Since your data contains comma, change field delimiter.
Use --fields-terminated-by <char> in your sqoop import command.
You might find these commands useful:
--hive_drop-import-delims or --hive-delims-replacement
More info here
We are having the below problems with data import using COPY utility.
COPY command isn't allowing special symbols like €, ¥ while data loading and this fails to insert the data in the table from CSV.
Do we have any extra parameters for the COPY command that will allow the special symbols?
sample.csv
id symbols
68f963d0-55a3-4453-9897-51c10c099499 $
41f172d6-0939-410a-bcde-5bcf96509710 €
50f925e7-c840-485c-bec0-23711c79ea11 ¥
c3ccc350-5734-42c4-a07d-9647f72c4236 $
"currency" is the table name with id as partion key.
COPY currency from 'sample.csv' with header=true;
-> This load only records with $ symbol but skips 2nd and 3rd records where symbols are €, ¥
How to skip comma(,)in a column and insert the complete string using COPY command?
I have the following values for a column(COMMENTS) in csv and want to insert them in cassandra table. Here in this case COMMA is treated as next column value as csv is comma separated. Do we have any feature/parameters that
allows COMMA(,) in the single columns using COPY COMMAND.
Tried with
COPY TABLENAME from 'test.csv' with header=true and QUOTE='"';
But couldn't load the below columns data which are COMMA separated in the single column.
Ex:
COMMENTScolumn
"Exact", "and $210/mt MLR China at most"
Since it’s a mill as indi,bid
"Range, and $430/mt FOB India at most"
Couldn't find any parameters for COPY command utility.
Versions
Cassandra 2.1.13
cqlsh 5.0.1
1) Support for special characters
I am using cqlsh 5.0.1 with Cassandra 3.0.2 and I am able to load € and other symbols from csv file into cassandra using cqlsh. Can you specify your cqlsh version also so I can check in that one.
2) Support of comma inside a String
I will suggest to use a different delimiter while using Copy To command and then use the same delimiter for Copy From Command.
Ex:
COPY global to 'global.csv' WITH DELIMITER='|';
COPY global from 'global.csv' WITH DELIMITER='|';
This way you can escape commas in your Strings.