How to compare two CSV file in Nifi? - groovy

Is it possible to compare two csv files in nifi(1.7) using groovy script? One file is at SFTP server and other is flow file. Problem is similar to that of link given below.
I have tried the approach given in link below :-
How can I compare the one line in one CSV with all lines in another CSV file?

Related

Why can't I merge multiple parquet files using "cat file1.parquet file2. parquet > result.parquet"?

I have created multiple parquet files using pyspark and now I'm trying to merge all the parquet files to 1. I'm able to merge the files, but while reading in the resulting file, I'm getting an error. Have anyone faced this issue before?
You cannot simply concatenate Parquet files using cat as they are binary files with a "table of contents" in the footer. To merge two files, you would have to read them both in and write out a completely new file. This could be done easily using the merge command in the parquet-tools.
The technical background that merging two Parquet files using cat isn't working comes down to the fact that a Parquet file is useless without a footer. Every Parquet file is made up roughly by the following structure:
RowGroup(nrows=..)
Column with nrows
Column with nrows
..
RowGroup(nrows=..)
..
..
Footer
Schema (tells you the type of the columns)
total_nrows
Location of RowGroups in the file
If you cat two files together, you would only see the last footer of the two files. Additionally, if the library used to read the files does some integrity checks, it will realise that your file is broken in some fashion and completely error out.

how to store multiple files in one file in python?

How can I store multiple files in one file using python?
I mean my own file format not a zip or a rar.
For e.g I want to create an archive from a folder but with my own file format. ( like 'Files.HR' )
Or just storing files in one file without any dictionary or file format. ( 'Files' No file format )
You may want to use "tar" files. In python, you can use the tarfile module to write files in the file and then later extract them back into real files.
You do not have to name the file *.tar. You can name it something else related to your specific application, such as naming it Files.HR.
Please see this nice tutorial or read the official docs to see how to use tarfile.

How to read specific files from a directory based on a file name in spark?

I have a directory of CSV files. The files are named based on date similar to the image below:
I have many CSV files that go back to 2012.
So, I would like to read the CSV files that correspond to a certain date only. How is that could be possible in spark? In other words, I don't want my spark engine to bother and read all CSV files because my data is huge (TBs).
Any help is much appreciated!
You can specify a list of files to be processed when calling the load(paths) or csv(paths) methods from DataFrameReader.
So an option would be to list and filter files on the driver, then load only the "recent" files :
val files: Seq[String] = ???
spark.read.option("header","true").csv(files:_*)
Edit :
You can use this python code (not tested yet)
files=['foo','bar']
df=spark.read.csv(*files)

Avoid overwriting of files with "for" loop

I have a list of dataframes (df_cleaned) created from multiple csv files chosen by the user.
My objective is to save each dataframe within the df_cleaned list as a separate csv file locally.
I have the following code done which saves the file with its original title. But I see that it overwrites and manages to save a copy of only the last dataframe.
How can I fix it? According to my very basic knowledge perhaps I could use a break-continue statement in the loop? But I do not know how to implement it correctly.
for i in range(len(df_cleaned)):
outputFile = df_cleaned[i].to_csv(r'C:\...\Data Docs\TrainData\{}.csv'.format(name))
print('Saving of files as csv is complete.')
You can create a different name for each file, as an example in the following I attach the index to name:
for i in range(len(df_cleaned)):
outputFile = df_cleaned[i].to_csv(r'C:\...\Data Docs\TrainData\{0}_{1}.csv'.format(name,i))
print('Saving of files as csv is complete.')
this will create a list of files named <name>_N.csv with N = 0, ..., len(df_cleaned)-1.
A very easy way of solving. Just figured out the answer myself. Posting to help someone else.
fileNames is a list I created at the start of the code to save the
names of the files chosen by the user.
for i in range(len(df_cleaned)):
outputFile = df_cleaned[i].to_csv(r'C:\...\TrainData\{}.csv'.format(fileNames[i]))
print('Saving of files as csv is complete.')
Saves a separate copy for each file in the defined directory.

Compare CSV file data using Nodejs

I want to compare the data in two .csv files.Have to compare the updated data between these two .csv file using nodejs.
Is ther any possibilities to do it in Nodejs.
Thanks,I am very newbie to this.
It will be easiest using one of the following modules:
https://www.npmjs.com/package/csv
https://www.npmjs.com/package/tsv
or other that you find in:
https://www.npmjs.com/browse/keyword/csv
https://www.npmjs.com/browse/keyword/tsv
(don't worry if it's CSV or TSV - just make sure that you use the correct delimiter which is comma in your case).

Resources