How to merge two single column csv files with linux commands - linux

I was wondering how to merge two single column csv files into one file where the resulting file will contain two columns.
file1.csv
first_name
chris
ben
jerry
file2.csv
last_name
smith
white
perry
result.csv
first_name,last_name
chris,smith
ben,white
jerry,perry
Thanks

$ cat file1
John
Mary
$ cat file2
Smith
Jones
$ paste -d, file1 file2
John,Smith
Mary,Jones
The -d, argument is used to designate commas as the delimiter between columns

You're looking for paste.

Related

How to compare the columns of file1 to the columns of file2, select matching values, and output to new file using grep or unix commands

I have two files, file1 and file2, where the target_id compose the first column in both.
I want to compare file1 to file2, and only keep the rows of file1 which match the target_id in file2.
file2:
target_id
ENSMUST00000128641.2
ENSMUST00000185334.7
ENSMUST00000170213.2
ENSMUST00000232944.2
Any help would be appreciated.
% grep -x -f file1 file2 resulted in no output in my terminal
Sample data that actually shows overlaps between the files.
file1.csv:
target_id,KO_1D_7dpi,KO_2D_7dpi
ENSMUST00000178537.2,0,0
ENSMUST00000178862.2,0,0
ENSMUST00000196221.2,0,0
ENSMUST00000179664.2,0,0
ENSMUST00000177564.2,0,0
file2.csv
target_id
ENSMUST00000178537.2
ENSMUST00000196221.2
ENSMUST00000177564.2
Your grep command, but swapped:
$ grep -F -f file2.csv file1.csv
target_id,KO_1D_7dpi,KO_2D_7dpi
ENSMUST00000178537.2,0,0
ENSMUST00000196221.2,0,0
ENSMUST00000177564.2,0,0
Edit: we can add the -F argument since it is a fixed-string search. Plus it adds protection against the . matching something else as a regex. Thanks to #Sundeep for the recommendation.

Finding repeated names in a file

Hi i have a txt file with last name and name of people, now i want do use egrep to only display the names of the people with the same last name. I have no idea how i could do this. Thanks for help
my txt looks like this:
snow john
snow jack
miller george
mcconner jenny
and the output should be:
john
jack
I've currently tried running:
cat names.txt | cut -d " " -f 1 | awk 'seen[$]++'
...but this fails with an error:
awk: syntax error at source line 1
context is
>>> seen[$] <<<
awk: bailing out at source line 1
You can use a typical 2-pass approach with awk:
awk 'NR == FNR {freq[$1]++; next} freq[$1]>1{print $2}' file file
john
jack
Reference: Effective AWK Programming
awk is your friend. With a single pass approach, you could achieve your result using a memory technique where you store last record in variables
Given an input file as follows:
$ cat file
snow john
snow jack
miller tyler
snow leopard
kunis ed
snow jack
snow miller
snow miller
sofo mubu
sofo gubu
...the following shell command uses a single awk pass to generate correct output:
$ awk 'count1[$1]==1 && ++count2[name[$1]]==1{print fn} # replica of next step with prev record values
count1[$1]++ && ++count2[$2]==1{print $2} # our main logic
{name[$1]=$2} # Here,we keep a copy of current record for next passes
' file
john
jack
leopard
miller
mubu
gubu
Note: The final answer includes the suggestion from #ordoshsen mentioned in [ this ] comment. For more on awk, refer [ the manual ].

How to use awk to delete lines of file1 whose column 1 values exist in file2 in Ubuntu?

Say we have file1.csv like this
"agvsad",314
"gregerg",413
"dfwer",53214
"fewf",344
and file2.csv like this
"dfwer"
"fewf"
how to use awk to delete those lines whose column 1 values exist in file2 and get a file3 looks like:
"agvsad",314
"gregerg",413
By the way I am dealing with millions of lines
awk 'NR==FNR{seen[$0]++; next} !seen[$1]' file2.csv FS=, file1.csv should do what you want but it will require enough memory to store an entry for each line in file2.csv.
As an alternative, using grep:
$ grep -vf file2.csv file1.csv
"agvsad",314
"gregerg",413

Command line to consider common values in only in specific column

I am looking for an simple command line to help me with the following task.
I have two files and I would like to print the lines for which they have a value in Col2 in common.
For instance File1 is similar to the following 3-column tab separated example
File1
cat big 24
cat small 13
cat red 63
File2
dog big 34
chicken plays 39
fish red 294
desired output
big
red
I have tried commands using the commsyntax: comm /path/to/file1/ /path/to/file2
However, it does not output me anything because the values in Col1 and Col3 will very rarely be in common.
Does anyone have a suggestion as to how this can be solved, maybe awk is a better solution?
if you read the man page of comm, you will see it works with sorted files. But awk is flexible, you can control what you want:
awk 'NR==FNR{a[$2]=1;next}a[$2]{print $2}' file1 file2
You could do it in a single pass with paste and awk:
paste file1 file2 | awk '$2 == $5 { print $2 }'
Output:
big
red

Identify which keywords can be found in which files

The Problem
Suppose I have a text file containing a list of words. Each word appears on a separate line. Let's take the following as an example and we'll call it "my_dictionary_file":
my_dictionary_file.txt
Bill
Henry
Martha
Sally
Alex
Paul
In my current directory, I have several files which contain the above names. The problem is that I do not know which files contain which names. This is what I'd like to find out; a sort of matching game. In other words, I want to match each name in my_dictionary_file.txt to the file in which the name appears.
As an example, let's say that the files in my working directory look like the following:
file1.txt
There is a man called Bill. He is tall.
file2.txt
There is a girl called Martha. She is small.
file3.txt
Henry and Sally are a couple.
file4.txt
Alex and Paul are two bachelors.
What I've tried
First. Using the fgrep command with the -o and -f options,
$ fgrep -of my_dictionary_file.txt file1.txt
Bill
I can identify that the name Bill can be found in file1.txt.
Second. Using the fgrep command with the -r -l and -f options,
$ fgrep -rlf names.txt .
./names.txt
./file1.txt
./file4.txt
./file3.txt
./file2.txt
I can search through all of the files in the current directory to find out if the files contain the list of names in my_dictionary_file.txt
The sought-after solution
The solution that I am looking for would be along the lines of combining both of the two attempts above. To be more explicit, I'd like to know that:
Bill belongs to file1.txt
Martha belongs to file2.txt
Henry and Sally belong to file3.txt
Alex and Paul belong to file4.txt
Any suggestions or pointers towards commands other than fgrep would be greatly appreciated!
Note
The actual problem that I am trying to solve is a scaled up version of this simplified example. I'm hoping to base my answer on responses to this question, so bear in mind that in reality the dictionary file contains hundreds of names and that there are a hundred or more files in the current directory.
Typing
$ fgrep -of my_dictionary_file.txt file1.txt
Bill
$ fgrep -of my_dictionary_file.txt file2.txt
Martha
$ fgrep -of my_dictionary_file.txt file3.txt
Henry Sally
$ fgrep -of my_dictionary_file.txt file4.txt
Alex Paul
does, of course, get me the results, but I'm looking for an efficient method to collect the results for me - perhaps, pipe the results to a single .txt file.
If you fgrep all the files at once with the -o option, fgrep should print both the file name and the text that matched:
$ fgrep -of dict.txt file*.txt
file1.txt:Bill
file2.txt:Martha
file3.txt:Henry
file3.txt:Sally
file4.txt:Alex
file4.txt:Paul

Resources