Trouble when using data in spark? - apache-spark

I have ingested data using sqoop in hdfs , however my data contains comma ',' in single columns. when I am using the same data in spark then it is taking each comma as separator. what can I do to change these comma's?
suppose if you have xyz column inside I have a,b,c in first line and cd in second line then what I can do to avoid these comma's?

While importing data in text format, default field delimiter is comma(,). Since your data contains comma, change field delimiter.
Use --fields-terminated-by <char> in your sqoop import command.

You might find these commands useful:
--hive_drop-import-delims or --hive-delims-replacement
More info here

Related

Replace single slash with double slash

I have the below Apache Spark Dataframe df_result where it has one column Name.
df_result.repartition(1).write.option('header','true').option("delimiter","|").option("escape", "\\").mode("overwrite").csv(path)
In this column, the values are like below.
Name
.....
John
Mathew\M
In the second row, there is a \ character. When I export this to csv using the above script, it generates the value as Mathew\M in the file. Ideally, I need the value as Mathew\\M in the file (ie, single \ should be replaced with \\). Is there a way to do this using the option or any other ways?
Am using Apache Spark 3.2.1.
Does this help? Seems to work for me
df.withColumn('Name', regexp_replace('Name','\\\\',r'\1\\\\')).write.option('header','true').option("delimiter","|").option("escape", "\\").mode("overwrite").csv("/tmp/output/wk/bn.cv")

Want to Ignore Comma(,) in ADF as delimiter

Have text tab delimited file which I'm converting into csv in ADF
In which, some field having comma separated values ( Don't want to split into columns) -> want to treat it as single column
But in ADF trying to separate those values into multiple column
Although, I have set a delimiter as "TAB(\t)", still ADF taking Comma and tab both as delimiter
why?
As you can see in the above Ex,
I want to split this on basis of '\t' delimiter but ADF considering 'tab' as well as 'comma' as a delimiter(Want to treat [pp,aa] as single value/column)
Any solution? plz let me know.
Thank you
Attaching ADF conf as well.
Set the quote character to a single character instead of "No quote character" in source and sink datasets.
I tried with the sample and was able to get the data as expected.
Source:
Source Dataset:
Sink Dataset:
output:
Reference - Dataset Properties
Property
Description
quoteChar
The single character to quote column values if it contains column delimiter. The default value is double quotes ".
escapeChar
The single character to escape quotes inside a quoted value. The default value is backslash \.

How to ingest comma separated csv with unquoted commas in some columns in ADF

Hi everyone I have not seen this particular issue pop up, I've seen a few related but none address this.
I have very big CSVs (up to 8gb) with comma as delimiter, free text in some columns, and commas in some of that free text.
As requirements, I cannot generate or ask for the CSVs to be generated again with another delimiter, and I have to use Data Flow to achieve this.
I would like to learn how to deal with text such as:
A, some text 2132,ALL, free text 00001,2020-11-29 - 2020-12-05
A, some text 2132,ALL, free text\,more text 0002,2018-12-09 - 2018-12-15
A, some text 2132,ALL, free text\,more text 00003,2018-12-09 - 2018-12-15
Things I have tried:
I tried making both simple Data Flows and Copy Activities in order to see if the parser did the operation properly, which it did not, didn't matter what combination of configuration of dataset as csv I tried.
I tried reading the whole csv as one column, writing to file with the "," regexed out, this has the issue of "losing" the commas from the csv so I have spaces as delimiter so I am back to square one, not having a proper delimiter, since text has spaces and would break.
Actually, data factory can't deal with the csv file which column data have the character same with column delimiter. It will cause the schema/column missing.
Even with Data Flow, Data Factory will always recognize the first row as the schema according the delimiter number.
As you said you can't change the source csv file and can't use data flow. then I'm afraid to say I we can't achieve it in Data Factory.
What I did for this to work (did it it twice with different results so I am still missing things, but one of them it worked).
Create a dataset with no delimiter, so the whole CSV row is read as a column. Use dataflow replace function there to make the problematic string dissapear. Write to disk as CSV.
Read as CSV with proper delimiter. Do whatever data needs done, write to disk as parquet. That appears to work.

Copy Activity missing column in the output

I have a copy activity which takes the output of a procedure and writes it to a temp CSV file. I needed to have headers in double quotation mark so after that I have a Data Flow task that takes the temp file and adds the quote all in the sink settings. Yet the output is not what is expected. It looks like the last column is missing in some of the records due to comma in the data.
Is there a way to use only copy activity but still have the column names in double quotes?
When we set the column delimiter, data factory will consider the first row as the schema according the delimiter number. If your data which has the value which same with the column delimiter, then you will miss some columns.
Just for now in Data Factory, we can't solve it. The only way is that please se the different column delimiter, for example the '|':
Output example:
And we also can't make the header wrapped by double quote for the output .csv file. It's not supported in Data Factory.
HTH.

Does COPY utility in Cassandra have parameters that allows Special symbols and Comma separated strings in a single column to insert?

We are having the below problems with data import using COPY utility.
COPY command isn't allowing special symbols like €, ¥ while data loading and this fails to insert the data in the table from CSV.
Do we have any extra parameters for the COPY command that will allow the special symbols?
sample.csv
id symbols
68f963d0-55a3-4453-9897-51c10c099499 $
41f172d6-0939-410a-bcde-5bcf96509710 €
50f925e7-c840-485c-bec0-23711c79ea11 ¥
c3ccc350-5734-42c4-a07d-9647f72c4236 $
"currency" is the table name with id as partion key.
COPY currency from 'sample.csv' with header=true;
-> This load only records with $ symbol but skips 2nd and 3rd records where symbols are €, ¥
How to skip comma(,)in a column and insert the complete string using COPY command?
I have the following values for a column(COMMENTS) in csv and want to insert them in cassandra table. Here in this case COMMA is treated as next column value as csv is comma separated. Do we have any feature/parameters that
allows COMMA(,) in the single columns using COPY COMMAND.
Tried with
COPY TABLENAME from 'test.csv' with header=true and QUOTE='"';
But couldn't load the below columns data which are COMMA separated in the single column.
Ex:
COMMENTScolumn
"Exact", "and $210/mt MLR China at most"
Since it’s a mill as indi,bid
"Range, and $430/mt FOB India at most"
Couldn't find any parameters for COPY command utility.
Versions
Cassandra 2.1.13
cqlsh 5.0.1
1) Support for special characters
I am using cqlsh 5.0.1 with Cassandra 3.0.2 and I am able to load € and other symbols from csv file into cassandra using cqlsh. Can you specify your cqlsh version also so I can check in that one.
2) Support of comma inside a String
I will suggest to use a different delimiter while using Copy To command and then use the same delimiter for Copy From Command.
Ex:
COPY global to 'global.csv' WITH DELIMITER='|';
COPY global from 'global.csv' WITH DELIMITER='|';
This way you can escape commas in your Strings.

Resources