apache metamodel -where on non-string columns in files - excel

I need your help,
I'm using org.apache.metamodel in order to read DataSet of files (excel, csv..),
while doing so I've noticed that all the returned columns are classified as "String", my problem is that I'm trying to filter columns values using org.apache.metamodel.query.FilterItem, but I see that the compare is a "String Comparison"- for example where (i<2) will return 1,10,111 etc..
I came across Data type conversion inside: "http://wiki.apache.org/metamodel/examples/DataTypeConversion"
but when i use it, it doesn't really changes the column type and my results are not correct.
so how can i filter on non-string columns in files,
Thanks in advance!

Related

Change data in Pandas dataframe by column

I have some data I imported from a excel spreadsheet as a csv. I created a dataframe using Pandas, and want to change a specific column. The column contains strings such as "5.15.1.0.0". I want to change these strings to floats like "5.15100".
So far I've tried using the method "replace" to change every instance in that column:
df['Fix versions'].replace("5.15.1.0.0", 5.15.1.0.0)
this however does not work. When I reprint the dataframe after the replace methods are called it shows me the same dataframe where no changes are made. Is it not possible to change a string to a float using replace? If not does anyone know another way to do this?
I could parse each string and remove the "." but I'd prefer not to do it this way as some of the strings represent numbers of different lengths and decimal place values.
Adding the parameter "inplace" which default is false. Changing this to true will change the dataframe in place, which can be type casted.
df['Fix versions'].replace(to_replace="5.15.1.0.0", value="5.15100", inplace=True)

How to find non compliant rows in pyspark

I am trying to find and segregate some rows whose certain columns don't follow certain pattern. I found the following example from databricks document to identify and check column values are integer or not and write the bad records into json file.
I want to identify whether one column values are like 1,245.00 and bad records will be like 1.245,00.
The values can vary the number of digits and just want to check whether data follows patter like 1,245.00 in pyspark.
Sometimes in the raw data, commas and dots are inter changed.
Can someone tell me how to collect such records in badrecordpath as in the following example?
// Creates a json file containing both parsable and corrupted records
Seq("""{"a": 1, "b": 2}""", """{bad-record""").toDF().write.text("/tmp/input/jsonFile")
val df = spark.read
.option("badRecordsPath", "/tmp/badRecordsPath")
.schema("a int, b int")
.json("/tmp/input/jsonFile")
df.show()
The above example is in scala and am looking for pyspark solution if possible. Thanks.
Please find some examples as below (with 2 last decimal points)
1,245.00
3,5000.80
6.700,00
5.7364732.20
4,500,600.00
dataframe with following data (with compliance) should have dot and two digits decimals
1,245.00
3,5000.80
4,500,600.00
Illegal data points should be kept in badRecordsPath (a comma before the decimal point)
6.700,00
5.7364732,20
Thanks

Transform data types in parts of a column

I am retrieving data through Power Query from an Oracle DB live to an Excel workbook. In PQ, under the "Transform" tab, there is a function to change the data type of a column, that I use to get all the decimal numbers displayed. In the M-code the function is called TransformColumnTypes. However I have some strings in the data that I cannot change to decimal number and produce an error. Is there a way to exclude these? Because the function takes the whole column at the moment.
Before applying function
Function producing error
Code
I don't think so. If you have multiple types within a column, text is the only one that doesn't produce errors.
But if it is only the first row like in your image, promoting it to header before setting the column type will fix the issue.

SSRS - Lookup field exported shown as #Error

I was wondering if anybody else came across the same problem. It's been bugging me for weeks now.
I have a big table in my report pulling data from DataSet A, one column contains data from DataSet B and is pulled using the followin:
=Lookup(Fields!Date.Value, Fields!Date.Value, Fields!Sales.Value, "Sales")
Running report shows no errors, all numbers are displayed. However when I export the report to any format (tried CSV, EXCEL, PDF) the value from lookup column comes as #Error (CSV, PDF) or #VALUE! (EXCEL).
Looking for any help on the internet proved to be unsuccessful. Is there anybody who stumpled upon the same problem?
If you can lookup the data field from another dataset properly, it should not be the issue on lookup() function. I've also tested in my local environment, the data can be rendered after exporting. In this scenario, I suggest pay attention to the data type and data length. Can you show the sample data in your BIDS/SSDT?
Turns out it was the fact that I was returning more than one row from the other dataset (not always but in some cases yes). Normally I would create calculate field (different currencies and sales value) and sum them in the report. SSRS didn't have problem showing the first value by default but export functionality failed, since I did't SUM() rows that were find. I change the Dataset then to calculate the values after currency conversion and them wrapped it in another select statement to make sure I only have one for my Date.Value

How to get the max no. of columns filled in an XLSX file using POI?

I know we can get the max number of columns by iterating over all the rows and calling getLastCellNumber on each row object.. but this approach requires iterating over all the rows which I want to avoid since it will take lot of time for files with a million rows(that’s the kind of files I am expecting to be read).
When POI reads a excel file, it stores the sheet dimensions (first row number, last row number , first col number, last col number) in an object of the DimensionsRecord class. So if I get this object I will get what I need. These objects can be obtained from the Sheet class which is an inner class of POI. I was able to extract what I need for XLS files, but I have hit a roadblock for XLSX files.
Does POI maintain DimensionsRecord object for XLSX also?, if yes has anybody tried to extract it? Or Is there some other by which this can be done?? please help!
Also I wanted to ask, whether my approach is correct or not, i.e I am using the inner classes of POI (it is getting my work done), is this correct or should I solely rely on exposed APIs (too time consuming).
There's a dimension object on XSSF Sheets too. Try:
CTSheetDimension dimension = sheet.getCTWorksheet().getDimension();
String sheetDimensions = dimenson.getRef();
The one issue that springs to mind is I'm not sure if it's required for the dimension (CTDimensions or DimensionsRecord) to always be correct...

Resources