find incorrect word from dataframes in spark scala - apache-spark

I am learning scala/spark. As part of practice, I am trying to search incorrect words from dataframe. Below is input dataframe which has no header and it is pipe('|') delimited.
1|I am hair for spark practss
2|I thank I am doing vary badd
3|But stay tuned whoo knooow
4|That's foor now
I have other dataframe where list of correct and incorrect word specified(first column is incorrect word and other is correct word). Below is sample dataframe which has no header and pipe delimited.May I know how can I find list of incorrect word per line from input dataframe.
hair|here
thank|think
whoo|who
knooow|know
foor|for

Related

Pandas read_csv - dealing with columns that have ID numbers with consecutive '$' and '#' symbols, along with letters and digits

I'm trying to read a csv file with a column of data that has a scrambled ID number that includes the occasional consecutive $$ along with #, numbers, and letters.
SCRAMBLE_ID
AL9LLL677
AL9$AM657
$L9$$4440
#L9$306A1
etc.
I tried the following:
df = pd.read_csv('MASTER~1.CSV',
dtype = {'SCRAMBLE_ID': str})
which rendered the third entry as L9$4440 (L9 appear in serif font, italicized, and the first and second $ vanish).
Faced with an entire column of ID numbers configured in this manner, what is the best way of dealing with such data? I can imagine:
PRIOR TO pd.read_csv: replacing the offending symbols with substitutes that don't create this problem (and what would those be), OR,
is there a way of preserving the IDs as is but making them into a data type that ignores these symbols while keeping them present?
Thank you. I've attached a screenshot of the .csv side by side with resulting df (Jupyter notebook) below.
csv column to pandas df with $$
I cannot replicate this using the same values as you in a mock CSV file.
Are you sure that the formatting based on the $ symbol is not occurring in wherever you are rendering your dataframe values? Have you checked to see if the data in the dataframe is what you expect or are you just rendering it externally?

How to find non compliant rows in pyspark

I am trying to find and segregate some rows whose certain columns don't follow certain pattern. I found the following example from databricks document to identify and check column values are integer or not and write the bad records into json file.
I want to identify whether one column values are like 1,245.00 and bad records will be like 1.245,00.
The values can vary the number of digits and just want to check whether data follows patter like 1,245.00 in pyspark.
Sometimes in the raw data, commas and dots are inter changed.
Can someone tell me how to collect such records in badrecordpath as in the following example?
// Creates a json file containing both parsable and corrupted records
Seq("""{"a": 1, "b": 2}""", """{bad-record""").toDF().write.text("/tmp/input/jsonFile")
val df = spark.read
.option("badRecordsPath", "/tmp/badRecordsPath")
.schema("a int, b int")
.json("/tmp/input/jsonFile")
df.show()
The above example is in scala and am looking for pyspark solution if possible. Thanks.
Please find some examples as below (with 2 last decimal points)
1,245.00
3,5000.80
6.700,00
5.7364732.20
4,500,600.00
dataframe with following data (with compliance) should have dot and two digits decimals
1,245.00
3,5000.80
4,500,600.00
Illegal data points should be kept in badRecordsPath (a comma before the decimal point)
6.700,00
5.7364732,20
Thanks

Python Pandas: Split Column & Unpivot with Varying Results

I am running NLTK on a dataset and am looking to clean the end result by creating a single column containing the single words. A desired before/after view is shown below.
I figure that this would be a combination of splitting Tokenized and melting the dataframe. What is confusing me is how to handle differing counts of words for each comment. Any thoughts on what would solve this?
You can do this by:
df1 = df.explode("Tokenized")
df1.rename({"Tokenized": "Single Word"})
pd.concat([df1, df["Tokenized"])\

Spark - Have I read from csv correctly?

I read a csv file into Spark using:
df = spark.read.format(file_type).options(header='true', quote='\"',
ignoreLeadingWhiteSpace='true',inferSchema='true').load(file_location)
When I tried it with sample csv data from another source and did diplsay(df) it showed a neatly displayed header row followed by data.
When I try it on my main data, which has 40 columns, and millions of rows, it simply displays the first 20 column headers and no data rows.
Is this normal behavior or is it reading it wrong?
Update:
I shall mark the question as answered as the tips below are useful. However my results from doing:
df.show(5, truncate=False)
currently shows:
|��"periodID","DAXDate","Country Name","Year","TransactionDate","QTR","Customer Number","Customer Name","Customer City","Document Type Code","Order Number","Product Code","Product Description","Selling UOM","Sub Franchise Code","Sub Franchise Description","Product Major Code","Product Major Description","Product Minor Code","Product Minor Description","Invoice Number","Invoice DateTime","Class Of Trade ID","Class Of Trade","Region","AmountCurrencyType","Extended Cost","Gross Trade Sales","Net Trade Sales","Total(Ext Std Cost)","AdjustmentType","ExcludeComment","CurrencyCode","fxRate","Quantity","FileName","RecordCount","Product Category","Direct","ProfitCenter","ProfitCenterRegion","ProfitCenterCountry"|
I shall have to go back to basics an preview the csv in a text editor to find out what the correct format is for this file to figure out what's going wrong. Note, I had to update my code to the following to deal with pipe delimter:
df = spark.read.format(file_type).options(header='true', quote='\"', delimiter='|',ignoreLeadingWhiteSpace='true',inferSchema='true').load(file_location)
Yes this is normal beheaviour. The dataframe function show() has a default value to display 20 rows. You can set a different value for that (but keep in mind that it doesn't make sense to print all rows of your file) and also stop it from truncating. For example:
df.show(100, truncate=False)
It is a normal behaviour. You can view the content of your data in different ways:
show(): Show you in a formatted way the first 20 rows. You can specify as argument the number of rows you want to display (if you provide a value much higher that your data is ok!). Columns will be truncated too, as a default configuration. You can specify truncate=False to show all the columns. (like #cronoik correctly said in his answer).
head(): The same as show(), but it prints the date in a "row" format. Does not provide a nice formatted table, it is useful for a quick complete look of your data, for example with head(1) to show only the first row.
describe().show(): you can show a summary that gives you an insight of the data. For example, shows you the count of elements, the min/max/avg value of each column.
It is normal for Spark dataframes to display limited rows and columns. Your reading of the data should not be a problem. However, to confirm that you have read the csv correctly you can try to see the number of rows and columns in the df, using
len(df.columns)
or
df.columns
For number of rows
df.count()
In case you need to see the content in detail you can use the option stated by cronoik.

Transforming one row into many rows using Spark

I'm trying to use Spark to turn one row into many rows. My goal is something like a SQL UNPIVOT.
I have a pipe delimited text file that is 360GB, compressed (gzip). It has over 1,620 columns. Here's the basic layout:
primary_key|property1_name|property1_value|property800_name|property800_value
12345|is_male|1|is_college_educated|1
There are over 800 of these property name/value fields. There are roughly 280 million rows. The file is in an S3 bucket.
The users want me to unpivot the data. For example:
primary_key|key|value
12345|is_male|1
12345|is_college_educated|1
This is my first time using Spark. I'm struggling to figure out a good way to do this.
What is a good way to do this in Spark?
Thanks.
The idea is to generate a list of lines from each input line as you have shown. This will give an RDD of lists of lines. Then use flatMap to get an RDD of individual lines:
If your file is loaded in as rdd1, then the following should give you what you want:
rdd1.flatMap(break_out)
where the function for processing lines is defined as:
def break_out(line):
# split line into individual fields/values
line_split=line.split("|")
# get the values for the line
vals=line_split[::2]
# field names for the line
keys=line_split[1::2]
# first value is primary key
primary_key=vals[0]
# get list of field values, pipe delimited
return(["|".join((primary_key, keys[i], vals[i+1])) for i in range(len(keys))])
You may need some additional code to deal with header lines etc, but this should work.

Resources