How to find non compliant rows in pyspark

How to find non compliant rows in pyspark - python-3.x

I am trying to find and segregate some rows whose certain columns don't follow certain pattern. I found the following example from databricks document to identify and check column values are integer or not and write the bad records into json file.
I want to identify whether one column values are like 1,245.00 and bad records will be like 1.245,00.
The values can vary the number of digits and just want to check whether data follows patter like 1,245.00 in pyspark.
Sometimes in the raw data, commas and dots are inter changed.
Can someone tell me how to collect such records in badrecordpath as in the following example?
// Creates a json file containing both parsable and corrupted records
Seq("""{"a": 1, "b": 2}""", """{bad-record""").toDF().write.text("/tmp/input/jsonFile")
val df = spark.read
.option("badRecordsPath", "/tmp/badRecordsPath")
.schema("a int, b int")
.json("/tmp/input/jsonFile")
df.show()
The above example is in scala and am looking for pyspark solution if possible. Thanks.

Please find some examples as below (with 2 last decimal points)
1,245.00
3,5000.80
6.700,00
5.7364732.20
4,500,600.00
dataframe with following data (with compliance) should have dot and two digits decimals
1,245.00
3,5000.80
4,500,600.00
Illegal data points should be kept in badRecordsPath (a comma before the decimal point)
6.700,00
5.7364732,20
Thanks

Related

How to drop columns from a pandas DataFrame that have elements containing a string?

This is not about dropping columns whose name contains a string.
I have a dataframe with 1600 columns. Several hundred are garbage. Most of the garbage columns contain a phrase such as invalid value encountered in double_scalars (XYZ) where `XYZ' is a filler name for the column name.
I would like to delete all columns that contain, in any of their elements, the string invalid
Purging columns with strings in general would work too. What I want is to clean it up so I can fit a machine learning model to it, so removing any/all columns that are not boolean or real would work.
This must be a duplicate question, but I can only find answers to how to remove a column with a specific column name.

You can use df.select_dtypes(include=[float,bool]) or df.select_dtypes(exclude=['object'])
Link to docs https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html

Use apply to make a mask checking if each column contains invalid, and then pass that mask to the second position of .loc:
df = df.loc[:, ~df.apply(lambda col: col.astype(str).str.contains('invalid')).any()]

find incorrect word from dataframes in spark scala

I am learning scala/spark. As part of practice, I am trying to search incorrect words from dataframe. Below is input dataframe which has no header and it is pipe('|') delimited.
1|I am hair for spark practss
2|I thank I am doing vary badd
3|But stay tuned whoo knooow
4|That's foor now
I have other dataframe where list of correct and incorrect word specified(first column is incorrect word and other is correct word). Below is sample dataframe which has no header and pipe delimited.May I know how can I find list of incorrect word per line from input dataframe.
hair|here
thank|think
whoo|who
knooow|know
foor|for

Pandas read_csv - dealing with columns that have ID numbers with consecutive '$' and '#' symbols, along with letters and digits

I'm trying to read a csv file with a column of data that has a scrambled ID number that includes the occasional consecutive $$ along with #, numbers, and letters.
SCRAMBLE_ID
AL9LLL677
AL9$AM657
$L9$$4440
#L9$306A1
etc.
I tried the following:
df = pd.read_csv('MASTER~1.CSV',
dtype = {'SCRAMBLE_ID': str})
which rendered the third entry as L9$4440 (L9 appear in serif font, italicized, and the first and second $ vanish).
Faced with an entire column of ID numbers configured in this manner, what is the best way of dealing with such data? I can imagine:
PRIOR TO pd.read_csv: replacing the offending symbols with substitutes that don't create this problem (and what would those be), OR,
is there a way of preserving the IDs as is but making them into a data type that ignores these symbols while keeping them present?
Thank you. I've attached a screenshot of the .csv side by side with resulting df (Jupyter notebook) below.
csv column to pandas df with $$

I cannot replicate this using the same values as you in a mock CSV file.
Are you sure that the formatting based on the $ symbol is not occurring in wherever you are rendering your dataframe values? Have you checked to see if the data in the dataframe is what you expect or are you just rendering it externally?

Spark - Have I read from csv correctly?

I read a csv file into Spark using:
df = spark.read.format(file_type).options(header='true', quote='\"',
ignoreLeadingWhiteSpace='true',inferSchema='true').load(file_location)
When I tried it with sample csv data from another source and did diplsay(df) it showed a neatly displayed header row followed by data.
When I try it on my main data, which has 40 columns, and millions of rows, it simply displays the first 20 column headers and no data rows.
Is this normal behavior or is it reading it wrong?
Update:
I shall mark the question as answered as the tips below are useful. However my results from doing:
df.show(5, truncate=False)
currently shows:
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |��"periodID","DAXDate","Country Name","Year","TransactionDate","QTR","Customer Number","Customer Name","Customer City","Document Type Code","Order Number","Product Code","Product Description","Selling UOM","Sub Franchise Code","Sub Franchise Description","Product Major Code","Product Major Description","Product Minor Code","Product Minor Description","Invoice Number","Invoice DateTime","Class Of Trade ID","Class Of Trade","Region","AmountCurrencyType","Extended Cost","Gross Trade Sales","Net Trade Sales","Total(Ext Std Cost)","AdjustmentType","ExcludeComment","CurrencyCode","fxRate","Quantity","FileName","RecordCount","Product Category","Direct","ProfitCenter","ProfitCenterRegion","ProfitCenterCountry"| +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
I shall have to go back to basics an preview the csv in a text editor to find out what the correct format is for this file to figure out what's going wrong. Note, I had to update my code to the following to deal with pipe delimter:
df = spark.read.format(file_type).options(header='true', quote='\"', delimiter='|',ignoreLeadingWhiteSpace='true',inferSchema='true').load(file_location)

Yes this is normal beheaviour. The dataframe function show() has a default value to display 20 rows. You can set a different value for that (but keep in mind that it doesn't make sense to print all rows of your file) and also stop it from truncating. For example:
df.show(100, truncate=False)

It is a normal behaviour. You can view the content of your data in different ways:
show(): Show you in a formatted way the first 20 rows. You can specify as argument the number of rows you want to display (if you provide a value much higher that your data is ok!). Columns will be truncated too, as a default configuration. You can specify truncate=False to show all the columns. (like #cronoik correctly said in his answer).
head(): The same as show(), but it prints the date in a "row" format. Does not provide a nice formatted table, it is useful for a quick complete look of your data, for example with head(1) to show only the first row.
describe().show(): you can show a summary that gives you an insight of the data. For example, shows you the count of elements, the min/max/avg value of each column.

It is normal for Spark dataframes to display limited rows and columns. Your reading of the data should not be a problem. However, to confirm that you have read the csv correctly you can try to see the number of rows and columns in the df, using
len(df.columns)
or
df.columns
For number of rows
df.count()
In case you need to see the content in detail you can use the option stated by cronoik.

apache metamodel -where on non-string columns in files

I need your help,
I'm using org.apache.metamodel in order to read DataSet of files (excel, csv..),
while doing so I've noticed that all the returned columns are classified as "String", my problem is that I'm trying to filter columns values using org.apache.metamodel.query.FilterItem, but I see that the compare is a "String Comparison"- for example where (i<2) will return 1,10,111 etc..
I came across Data type conversion inside: "http://wiki.apache.org/metamodel/examples/DataTypeConversion"
but when i use it, it doesn't really changes the column type and my results are not correct.
so how can i filter on non-string columns in files,
Thanks in advance!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to find non compliant rows in pyspark - python-3.x

Related

How to drop columns from a pandas DataFrame that have elements containing a string?

find incorrect word from dataframes in spark scala

Pandas read_csv - dealing with columns that have ID numbers with consecutive '$' and '#' symbols, along with letters and digits

Spark - Have I read from csv correctly?

apache metamodel -where on non-string columns in files

Categories

Resources