Reorder rows in a Kiba job - kiba-etl

I have a kiba job that takes a CSV file (with Kiba::Common::Sources::CSV), enrich its data, merge some rows (with the ChainableAggregateDestination destination described here) and saves it to another CSV file (with Kiba::Common::Destinations::CSV).
Now, I want to sort the rows differently (based on the first column) in my destination CSV. I can't find a way to write a transform that does this. I could use post_process to reopen the destination CSV, sort it and rewrite it but I guess there is a cleaner way...
Can someone point me in the right direction?

To sort rows, a good strategy is to use an "aggregating transform", as explained in this article, to store all the rows in memory (although you could do it out of memory), then at transform "close" time, sort them and re-emit them in the pipeline.
This is the most flexible design IMO.
class SortingTransform
def initialize(config...)
#rows = []
end
def process(row)
#rows << row
nil # do not emit rows right away
end
def close
# Here: sort the rows, optionally using external
# configuration passed at init time
#rows.sort_by { ... }.each do |row|
yield row
end
end
end
You could also indeed re-open the output and sort it, in a secondary ETL job, but the first solution usually has my preference if it can work for you.

Related

Separating values that are combined in one string

I would like to solve this either in Excel or in SPSS:
I have categorical data (each number representing a medical diagnosis) that are combined into single cells. In other words, a row (patient) has multiple diagnoses. However, I would like to know the frequencies of each diagnosis. What is the best way to go about this? (See picture for reference)
For SPSS:
First just creating some sample data to demonstrate on:
data list free/e_cerv_dis_state (a20).
begin data
"{1/2/3/6}" "{1/2/4}" "{2/4/5}" "{1/5/6}" "{4}" "{4/5/6}" "{1/2/3/4/5/6}"
end data.
Now the following code will create a separate variable for each possible diagnosis, and will put a 1 in it if the diagnosis exists in the original variable.
do repeat vr=diag1 to diag9/vl=1 to 9.
compute vr=char.index(e_cerv_dis_state, string(vl, f1) ) > 0.
end repeat.
freq diag1 to diag6.
Note this will only work for up to 9 diagnoses. If you have more than that the solution will have to be adapted to multiple digits.
Assuming that the number of columns is fairly regular, I would suggest using text to columns, and then using COUNTIF on the cells if they are the value wanted. However there is a more robust and reproducible solution that would involve using SQL. If you download the free version of SQL Express here: https://www.microsoft.com/en-gb/sql-server/sql-server-downloads
Then you can import your table of data, here's how to do that: How to import an Excel file into SQL Server?
Then you could use the more friendly SQL database to get the answers you want. For example you can use a select statement that would say:
SELECT count(e_cerv_dis_state)
WHERE e_cerv_dis_state = '6'
It would also be possible to use a CASE WHEN statement to add-in the names of the diagnoses.

Spark - Have I read from csv correctly?

I read a csv file into Spark using:
df = spark.read.format(file_type).options(header='true', quote='\"',
ignoreLeadingWhiteSpace='true',inferSchema='true').load(file_location)
When I tried it with sample csv data from another source and did diplsay(df) it showed a neatly displayed header row followed by data.
When I try it on my main data, which has 40 columns, and millions of rows, it simply displays the first 20 column headers and no data rows.
Is this normal behavior or is it reading it wrong?
Update:
I shall mark the question as answered as the tips below are useful. However my results from doing:
df.show(5, truncate=False)
currently shows:
|��"periodID","DAXDate","Country Name","Year","TransactionDate","QTR","Customer Number","Customer Name","Customer City","Document Type Code","Order Number","Product Code","Product Description","Selling UOM","Sub Franchise Code","Sub Franchise Description","Product Major Code","Product Major Description","Product Minor Code","Product Minor Description","Invoice Number","Invoice DateTime","Class Of Trade ID","Class Of Trade","Region","AmountCurrencyType","Extended Cost","Gross Trade Sales","Net Trade Sales","Total(Ext Std Cost)","AdjustmentType","ExcludeComment","CurrencyCode","fxRate","Quantity","FileName","RecordCount","Product Category","Direct","ProfitCenter","ProfitCenterRegion","ProfitCenterCountry"|
I shall have to go back to basics an preview the csv in a text editor to find out what the correct format is for this file to figure out what's going wrong. Note, I had to update my code to the following to deal with pipe delimter:
df = spark.read.format(file_type).options(header='true', quote='\"', delimiter='|',ignoreLeadingWhiteSpace='true',inferSchema='true').load(file_location)
Yes this is normal beheaviour. The dataframe function show() has a default value to display 20 rows. You can set a different value for that (but keep in mind that it doesn't make sense to print all rows of your file) and also stop it from truncating. For example:
df.show(100, truncate=False)
It is a normal behaviour. You can view the content of your data in different ways:
show(): Show you in a formatted way the first 20 rows. You can specify as argument the number of rows you want to display (if you provide a value much higher that your data is ok!). Columns will be truncated too, as a default configuration. You can specify truncate=False to show all the columns. (like #cronoik correctly said in his answer).
head(): The same as show(), but it prints the date in a "row" format. Does not provide a nice formatted table, it is useful for a quick complete look of your data, for example with head(1) to show only the first row.
describe().show(): you can show a summary that gives you an insight of the data. For example, shows you the count of elements, the min/max/avg value of each column.
It is normal for Spark dataframes to display limited rows and columns. Your reading of the data should not be a problem. However, to confirm that you have read the csv correctly you can try to see the number of rows and columns in the df, using
len(df.columns)
or
df.columns
For number of rows
df.count()
In case you need to see the content in detail you can use the option stated by cronoik.

Removing duplicates between multiple large CSV files

I am trying to find the best way to remove duplicates from large CSV files.
I receive CSV files of around 5/6 million rows every month.
I need to adjust these (I only need some of the columns, and I need to add some others).
The files also contain a lot of duplicate, and incomplete rows.
I've come up with a solution in python where I use a set and check for each row if it's in the set. And change what needs changing.
Now, I get the second file, and it contains a lot of duplicates that are in the previous file.
I'm trying to find an efficient solution to remove duplicates within the file, and between the different files. In the end I want to have a list (table or csv file) that contains only the new entries for that month.
I would like use python, and I was thinking about using a sqlite database for storing the data. But I'm unsure which way would be most efficient.
I would use numpy.unique():
import numpy as np
data = np.vstack((np.loadtxt("path/to/file1.csv"), np.loadtxt("path/to/file2.csv")))
#this will stack both arrays on top of each other, creating one giant array
data = np.unique(data, axis=0)
np.unique takes the entire array and returns only the unique elements. Make sure you set axis=0 so that it goes row by row and not cell by cell.
One caveat: This should work, but if there are several million rows, it may take a while. Still better than doing it by hand though! Good luck!

Replacing null values with zeroes in multiple columns [Spotfire]

I have about 100 columns with some empty values that I would like to replace with zeroes. I know how to do this with a single column using Calculate and Replace, but I wanted to see if there was a way to do this with multiple columns at once.
Thanks!
You could script it but it'd probably take you as long to write the script as it would to do it manually with a transformation. A better idea would be to fix it in the data source itself before you import it so SPOTFIRE doesn't have to do the transformation every time, which if you are dealing with a large amount of data, could hinder your performance.

Managing data sets in SPSS where multiple cases appear in one row

I'm working with a data set which has details on multiple people on one row. How I've dealt with this is to have variables like this:
P1Name P1Age P1Gender P1Ethnicity P2Name P2Age P2Gender... etc
This makes analysis very difficult. I have used multiple response variables which are good frequencies, but its unweildy, takes time to write out the syntax (there's a lot of 'p's) and you can't do other analysis with it.
first of all is there a way to run analyses as if all the name, age, gender and so on variables are all on the same row? (if that makes sense) To do this all I can think of doing is pasting the data into Excel and then cutting and pasting to get them all into the same columns, then pasting back to SPSS. Any other ideas?
Or is this just a matter of having two datasets, one for the case details and one for the people details?
Any advice would be greatly appreciated!
Write the data out using SAVE TRANSLATE and then read it back in removing the P's, like this:
FILE HANDLE MyFile /NAME='/Users/rick/tmp/test.csv'.
DATA LIST FREE /p1x1 p1x2 p1x3 p2x1 p2x2 p3x3.
BEGIN DATA.
1 2 3 1 2 3 1 2 3
END DATA.
LIST.
SAVE TRANSLATE
/OUTFILE=MyFile
/TYPE=CSV /ENCODING='UTF8' /REPLACE
/CELLS=VALUES.
DATA LIST FREE FILE=MyFile /x1 x2 x3.
LIST.
That should do it.

Resources