Hi I have 2 files and I want to compare them using the key field and a packed decimal field in the file.
I have searched in many forums but I could not get the solution.
Please provide me a solution either in the syncsort or DFsort.
File has LRECL of 200 and the field value is starting at 84 and the length is 9.
Both files have same format I need the ouput as below.
Key is starting from 1 and the lenght is 22 and type is alphanumeric for both the files.
File1:should have the matched records from file1 and file2
File2:should have the unmatched records from file1 and file2
Thanks in Advance,
Lakshmi
Yes, a very simple JOINKEYS. You'll need a JOIN UNPAIRED. You'll need two OUTFIL statements for your two output files.
It is unusual to have a choice of either DFSORT or SyncSort. DFSORT has a "matching marker" for JOINKEYS, so very easy to do the final extract. SyncSort relies on you testing for a value which cannot exist in the data to show that a record from one file or the other is not present. Find out which one you have (ICE messages in the sysout from the step are from DFSORT, WER messages are from SyncSort).
I can't believe that you won't be able to find many examples of JOINKEYS if you do a bit of googling.
If you get stuck, add what you have tried to your question, being clear about which SORT product you actually have access to.
If you look at the DFSORT documentation, I believe you will find that using the JOINKEYS statement will your desired result.
Related
I have huge_database.csv like this:
name,phone,email,check_result,favourite_fruit
sam,64654664,sam#example.com,,
sam2,64654664,sam2#example.com,,
sam3,64654664,sam3#example.com,,
[...]
===============================================
then I have 3 email lists:
good_emails.txt
bad_emails.txt
likes_banana.txt
the contents of which are:
good_emails.txt:
sam#example.com
sam3#example.com
bad_emails.txt:
sam2#example.com
likes_banana.txt:
sam#example.com
sam2#example.com
===============================================
I want to do some grep, so that at the end the output will be like this:
sam,64654664,sam#example.com,y,banana
sam2,64654664,sam2#example.com,n,banana
sam3,64654664,sam3#example.com,y,
I don't mind doing it in multiple steps manually and, perhaps, in some complex algorithm such as copy pasting to multple files. What matters to me is the reliability, and most importantly the ability to process very LARGE csv files with more than 1M lines.
What must also be noted is the lists that I will "grep" to add data to some of the columns will most of the times affect at most 20% of the total csv file rows, meaning the remaining 80% must be intact and if possible not even displace from their current order.
I would also like to note that I will be using a software called EmEditor rather than spreadsheet softwares like Excel due to the speed of it and the fact that Excel simply cannot process large csv files.
How can this be done?
Will appreciate any help.
Thanks.
Googling, trial and error, grabbing my head from frustration.
Filter all good emails with Filter. Open Advanced Filter. Next to the Add button is Add Linked File. Add the good_emails.txt file, set to the email column, and click Filter. Now only records with good emails are shown.
Select column 4 and type y. Now do the same for bad emails and change the column to n. Follow the same steps and change the last column values to the correct string.
I'm quite new to GNU Octave, so can anyone help me for 2 things:
(1) How can I filter that huge dataset in such a way that it will only contain [1x1 struct] persons?
(2) Inside that value of struct, I only want to retain combined_categories. How can I delete the others?
Basically, my end goal is to have a dataset with 2 columns only (filename and combined_categories of the filtered 1x1 structs). And if I can convert that to csv, that would be more awesome.
Regarding your first question, how to filter a struct. First step is to create a vector which decides which ones to keep and which ones to delete:
%Get the data for the relevant field
persons={test.person}
%For each field, check if the size is 1
one_person=cellfun(#numel,persons)==1
%Select those you want
test=test(one_person)
About your second question, please check the documentation for rmfield
I have two CSV files, one of 25 000 lines containing all data and one of 9000 lines containing names i need to get the data from the first one.
Someone told me that would be fairly easy using excel but i can't seem to find a similar problem.
I've tried comparisons tools, but they are not helping me isolate what i need.
Using this example
Master file :
Name;email;displayname
Bbob;Bbob#mail.com;Bob bob
Mmartha;Martha#mail.com;Mmartha
Cclaire;Cclaire#mail.com;cclair
Name file :
Name
Mmartha
Cclaire
What i need to get after comparison :
Name;email;displayname
Mmartha;Martha#mail.com;Mmartha
Cclaire;Cclaire#mail.com;cclair`
So for the names I've in my second csv, I've got to get the entire line from the master csv file.
Right now i can use notepad compare for exemple, but on 25000 lines considering what i need, it's a lot of manual labor to come. I think there is a way someone faced a similar issue.
I can't seem to find a solution right now so here I am.
Beforehand, excuses for the Dutch screenshots, I'm unsure about the English terms in PowerQuery, but you should be able to follow the procedure.
Using PowerQuery:
Start PowerQuery
Load both source CSV1 and CSV2
Join Query as new
Select both column 1 and select Inner option
Result should look like this:
Use first row as headers:
Delete 4th column, close and load values
I am working with two datasets in csv form (movielens latest-small dataset). Given below are the fields of both.
rating.csv
user_id movie_id rating
movie.csv
movie_id movie_name
what I want is to combine them into a single .csv with following fields
user_id movie_id movie_name rating
So that the common column movie_id maps with corresponding movie_name.
Could that be done using Excel? If not, how can I do it?
I just need it as a dataset for my recommender engine, so any simple solution is welcome as end result is all that matters. But since I've some experience in java so that would be easy for my easy understand and implement.
If there is some way using Excel then that would be the best. I have tried searching online and found some VLOOKUP method but couldn't clearly get it.
Also I tried some online merging tools but they just attached the sheets one after the another not mapping the column. So I have no problem using online tools too.
This is the method with a VLOOKUP formula within Excel:
The formula takes 4 arguments:
The value you are wanting to look up
The range of data you are looking into
The column within (2) that contains the answer you want
Whether to match on (1) approximately i.e. FALSE = exact match
See here for documentation on the function.
Check out this tool - https://github.com/DataFoxCo/gocsv - it's based off of csvkit but has a ton of additional features. One of our engineers custom built it - and open sourced it to help solve some of these data issues we deal with every day :)
It will do a vlookup essentially of any sized csv in merely seconds using the join command:
gocsv join --columns 'movie_id','movie_id' --left rating.csv movie.csv > combineddata.csv
then if you still want to reorder the columns, you can do that too:
gocsv select --columns 'user_id','movie_id','movie_name','rating' combineddata.csv > combineddata-final.csv
I split the commands out up top to help explain its use - the documentation has all the examples on it also but ultimately I would really recommend pipelining it and doing it in one command like this:
cat rating.csv \
| gocsv join --left --columns 'movie_id','movie_id' movie.csv \
| gocsv select --columns 'user_id','movie_id','movie_name','rating' > combineddata.csv
I'm using csvfix to sort a CSV file based on an integer (counter) value in the second column. However it seems that csvfix puts double quotes around all fields in the file, turning them to strings, before it performs the sort. The result is that the rows are sorted by the string value, such that "1000" comes before "2".
There is a command-line option -smq that is supposed to apply "smart quoting" but that's not helping me. If I use the command csvfix echo -smq file.csv, the output has no quotes around numerical fields, but when I pipe that into csvfix sort -f 2 file.csv, the file is written without quotes but still sorted in "string order". It makes no difference whether I include the -smq flag in the sort command or not.
Additionally I would like csvfix to ignore the first row of string headers. Csvfix issue tracking claims this is already implemented but I can only find the -ifn flag that seems to cut the header row out entirely.
These seem pretty basic pieces of functionality for this tool, so I'm probably missing something very simple. Hoping someone on here has used csvfix and figured out.
According to the on line documentation for csvfix, sort has a N option for numeric sorts:
csvfix sort -f 2:N file.csv
Having said this, CSV isn't a particularly good format for text manipulation. If possible, you're much better off choosing DSV (delimiter separated values) such as Tab or Pipe separated, so that you can simply pipe the output to sort, which has ample capability to sort by field, using whatever collation method you need.