how to create a txt file with columns being the descending sub-directories in Linux? - linux

My data follow the structure:
../data/study_ID/FF_Number/Exam_Number/date,
Where the data dir contains 176 participants` sub-directories. The ID number represents the participants ID, and each of the following sub-directories represents some experimental number.
I want to create a txt file with one line per participants and the following columns: study ID, FF_number, Exam_Number and date.
However it gets a bit more complicated as I want to divide the participants into chunks of ~ 15-20 ppt per chunk for the following analysis.
Any suggestions?
Cheers.

Hmm, nobody?
You should redirect output of "find" command, consider switches -type d, and -maxdepth, and probably parse it with sed, replacing "/" with "spaces". Maybe piping through "cut" and "column -t" commands, and "sort" and "uniq" will be useful. Do names, except FF and ID, contain spaces, or special characters e.g. related to names of participants?
It should be possible to get a TXT with "one liner" and a few pipes.
You should try, and post first results of your work on this :)
EDIT: Alright, I created for me a structure with several thousands of directories and subdirectories numbered by participant, by exam number etc., which look like this ( maybe it's not identical with what you have, but don't worry ). Studies are numbered from 5 to 150, FF from 45 to 75, and dates from 2012_01_00 to 2012_01_30 - which makes really huge quantity of directories in total.
/Users/pwadas/bzz/data
/Users/pwadas/bzz/data/study_005
/Users/pwadas/bzz/data/study_005/05_Num
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_00
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_01
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_02
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_03
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_04
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_05
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_06
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_07
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_08
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_09
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_10
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_11
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_12
Now, I want ( quote ) "txt file with one line per participants and the following columns: study ID, FF_number, Exam_Number and date."
So I use the following one-liner:
find /Users/pwadas/bzz/data -type d | head -n 5000 |cut -d'/' -f5-7 | uniq |while read line; do echo -n "$line: " && ls -d /Users/pwadas/bzz/$line/*Exam/* | perl -0pe 's/.*2012/2012/g;s/\n/ /g' && echo ; done > out.txt
and here is the output ( a few first lines from out.txt ). Lines are very long, I cutted it on output for first 80-90 characters:
dtpwmbp:data pwadas$ cat out.txt |cut -c1-90
data:
data/study_005:
data/study_005/05_Num: 2012_01_00 2012_01_01 2012_01_02 2012_01_03 2012_01_04 2012_01_05 2
data/study_005/06_Num: 2012_01_00 2012_01_01 2012_01_02 2012_01_03 2012_01_04 2012_01_05 2
data/study_005/07_Num: 2012_01_00 2012_01_01 2012_01_02 2012_01_03 2012_01_04 2012_01_05 2
data/study_005/08_Num: 2012_01_00 2012_01_01 2012_01_02 2012_01_03 2012_01_04 2012_01_05 2
dtpwmbp:data pwadas$
I hope this will help you a little, and you'll be able to modify it according to your needs and patterns, and that seems to be all I can do :) You should analyze the one liner, especially "cut" command, and perl-regex part, which removes newlines and full directory name from "ls" output. This is probably fair from optimal, but beautifying is not the point here, I guess :)
So, good luck :)
PS. "head" command limits output for N first lines, you'll probably want to skip out
| head .. |
part.

Related

Retrieve different information from several files to bring them together in one. BASH

I have a problem with my bash script, I would like to retrieve information contained in several files and gather them in one.
I have a file in this form which contains about 15000 lines: (file1)
1;1;A0200101C
2;2;A0200101C
3;3;A1160101A
4;4;A1160101A
5;5;A1130304G
6;6;A1110110U
7;7;A1110110U
8;8;A1030002V
9;9;A1030002V
10;10;A2120100C
11;11;A2120100C
12;12;A3410071A
13;13;A3400001A
14;14;A3385000G1
15;15;A3365070G1
I would need to retrieve the first record of each row matching the id.
My second file is this, I just need to retrieve the 3rd row: (file2)
count
-------
131
(1 row)
I would therefore like to be able to assemble the id of (file1) and the 3rd line of (file2) in order to achieve this result:
1;131
2;131
3;131
4;131
5;131
6;131
7;131
8;131
9;131
11;131
12;131
13;131
14;131
15;131
Thank you.
One possible way:
#!/usr/bin/env bash
count=$(awk 'NR == 3 { print $1 }' file2)
while IFS=';' read -r id _; do
printf "%s;%s\n" "$id" "$count"
done < file1
First, read just the third line of file2 and save that in a variable.
Then read each line of file1 in a loop, extracting the first semicolon-separated field, and print it along with that saved value.
Using the same basic approach in a purely awk script instead of shell will be much faster and more efficient. Such a rewrite is left as an exercise for the reader (Hint: In awk, FNR == NR is true when reading the first file given, and false on any later ones. Alternatively, look up how to pass a shell variable to an awk script; there are Q&As here on SO about it.)

Best way to identify similar text inside strings?

I've a list of phrases, actually it's an Excel file, but I can extract each single line if needed.
I need to find the line that is quite similar, for example one line can be:
ANTIBRATING SSPIRING JOINT (type 2) mod. GA160 (temp.max60°)
and some line after I can have the same line or this one:
ANTIBRATING SSPIRING JOINT (type 2) mod. GA200 (temp.max60°)
Like you can see these two lines are pretty the same, not equal in this case but at 98%
The main problem is that I've to process about 45k lines, for this reason I'm searching a way to do that in a quick and maybe visual way.
The first thing that came in my mind was to compare the very 1st line to the 2nd then the 3rd till the end, and so on with the 2nd one and the 3rd one till latest-1 and make a kind of score, for example the 1st line is 100% with line 42, 99% with line 522 ... 21% with line 22142 etc etc...
But is only one idea, maybe not the best.
Maybe out there's already a good program/script/online services/program, I searched but I can't find it, so at the end I asked here.
Anyone knows a good way (if this is possible) or script or one online services to achieve this?
One thing you can do is write a script, which does as follows:
Extract data from csv file
Define a regex which can conclude a similarity, a python example can be:
[\w\s]+\([\w]+\)[\w\s]+\([\w°]+\)
Or such, refer the documentation.
The problem you have is that you are not looking for an exact match, but a like.
This is a problem even databases have never solved and results in a full table scan.
So we're unlikely to solve it.
However, I'd like to propose that you consider alternatives:
You could decide to limit the differences to specific character sets.
In the above example, you were ignoring numbers, but respected letters.
If we can assume that this rule will always hold true, then we can perform a text replace on the string.
ANTIBRATING SSPIRING JOINT (type 2) mod. GA160 (temp.max60°) ==> ANTIBRATING SSPIRING JOINT (type _) mod. GA_ (temp.max_°)
Now, we can deal with this problem by performing an exact string comparison. This can be done by hashing. The easiest way is to feed a hashmap/hashset or a database with a hash index on the column where you will store this adjusted text.
You could decide to trade time for space.
For example, you can feed the strings to a service which will build lots of different variations of indexes on your string. For example, feed elasticsearch with your data, and then perform analytic queries on it.
Fuzzy searches is the key.
I found several projects and ideas, but the one I used is tree-agrep, I know that is quite old but in this case works for me, I created this little script to help me to create a list of differences, so I can manually check it with my file
#!/bin/bash
########## CONFIGURATIONS ##########
original_file=/path/jjj.txt
t_agrep_bin="$(command -v tre-agrep)"
destination_file=/path/destination_file.txt
distance=1
########## CONFIGURATIONS ##########
lines=$(grep "" -c "$original_file")
if [[ -s "$destination_file" ]]; then
rm -rf "$destination_file"
fi
start=1
while IFS= read -r line; do
echo "Checking line $start/$lines"
lista=$($t_agrep_bin -$distance -B --colour -s -n -i "$line" $original_file)
echo "$lista" | awk -F ':' '{print $1}' ORS=' ' >> "$destination_file"
echo >> "$destination_file"
start=$((start+1))
done < "$original_file"

Compare two files and extract values that you don't include in common to a third file

Basically I need to compare in the first room the file A that has 100 records (all of them numerical) and a file B that also has records (numerical). The idea is, to compare both files and generate a third party that gives me as an output, the numbers that are not Sundays both in file A and in B. That is if I compare A with B and the numbers that B has are not inside from A I generated a C file with those numbers that are not in common.
Example File A:
334030004141665
334030227891112
334030870429938
334030870429939
334030241924239
334030870429932
334030870429933
334030870429930
334030870429931
334030870429936
334030013091432
334030030028092
334030218459802
334030003074203
334030010435534
334030870429937
334030870429934
334030870429935
334030062679707
334030062679706
Example File B
334030013091432
334030030028092
334030218459802
334030003074203
334030010435534
334030010781511
334030010783039
334030204710123
334030203456292
334030203292057
334030010807268
334030010455298
334030240658153
334030218450890
334030023035316
334030010807456
334030010457538
334030071689268
334030204710136
Excpected File C
334030013091432
334030030028092
334030218459802
334030003074203
334030010435534
I have already tried with comm, diff, grep but nothing makes me work. The ideal would not be to sort the files, since these that I want to compare only have 100 records, but for the next one there will be more than one million records.
Thank you for your contributions.
I'm going to look through my fingers with I have already tried with comm, diff, grep this time, but next time post some actual trials.
To extract the common information in both files the obvious would be to use grep for it:
$ grep -f A B
Output:
334030013091432
334030030028092
334030218459802
334030003074203
334030010435534
but grep in that form would accept partial matches as well, so being lazy I wouldn't see the man grep (well, I did, it's grep -w -f A B) but use awk instead:
$ awk 'NR==FNR{a[$0];next}($0 in a)' A B
Explained:
$ awk '
NR==FNR { # process the first file in the list
a[$0] # hash record to a hash
next # move to next record in the first file
} # after this point process all the files after the first
($0 in a) # if record found in a hash, output it
' A B # put the smaller file first as it is stored in memory
Once you get to the million lines part, please time (time grep ... and time awk ...) the difference of both solutions and post in the comments.

Split single record into Multiple records in Unix shell Script

I have record
Example:
EMP_ID|EMP_NAME|AGE|SALARAy
123456|XXXXXXXXX|30|10000000
Is there a way i can split the record into multiple records. Example output should be like
EMP_ID|Attributes
123456|XXXXXXX
123456|30
123456|10000000
I want to split the same record into multiple records. Here Employee id is my unique column and remaining 3 columns i want to run in a loop and create 3 records. Like EMP_ID|EMP_NAME , EMP_ID|AGE , EMP_ID|SALARY. I may have some more columns as well but for sample i have provided 3 columns along with Employee id.
Please help me with any suggestion.
With bash:
record='123456|XXXXXXXXX|30|10000000'
IFS='|' read -ra fields <<<"$record"
for ((i=1; i < "${#fields[#]}"; i++)); do
printf "%s|%s\n" "${fields[0]}" "${fields[i]}"
done
123456|XXXXXXXXX
123456|30
123456|10000000
For the whole file:
{
IFS= read -r header
while IFS='|' read -ra fields; do
for ((i=1; i < "${#fields[#]}"; i++)); do
printf "%s|%s\n" "${fields[0]}" "${fields[i]}"
done
done
} < filename
Record of lines with fields separated by a special delimiter character such as | can be manipulated by basic Unix command line tools such as awk. For example with your input records in file records.txt:
awk -F\| 'NR>1{for(i=2;i<=NF;i++){print $1"|"$(i)}}' records.txt
I recommend to read a awk tutorial and play around with it. Related command line tools worth to learn include grep, sort, wc, uniq, head, tail, and cut. If you regularly do data processing of delimiter-separated files, you will likely need them on a daily basis. As soon as your data structuring format gets more complex (e.g. CSV format with possibility to also use the delimiter character in field values) you need more specific tools, for instance see this question on CSV tools or jq for processing JSON. Still knowledge of basic Unix command line tools will save you a lot of time.

Using grep for multiple patterns from multiple lines in a output file

I have a data output something like this captured in a file.
List item1
attrib1: someval11
attrib2: someval12
attrib3: someval13
attrib4: someval14
List item2
attrib1: someval21
attrib2: someval12
attrib4: someval24
attrib3: someval23
List item3
attrib1: someval31
attrib2: someval32
attrib3: someval33
attrib4: someval34
I want to extract attrib1, attrib3, attrib4 from the list of data only if "attrib2 is someval12".
note that attrib3 and attrib4 could be in any order after attrib2.
so far I tried to use grep with -A and -B option but I need to specify line number and that is sort of hardcoding which I don't want to do it.
grep -B 1 -A 1 -A 2 "attrib2: someval12" | egrep -w "attrib1|attrib3|attrib4"
can i use any other option of grep which doesn't involve specifying the before and after occurence for this example?
Grep and other tools (like join, sort, uniq) work on the principle "one record per line". It is therefore possible to use a 3-step pipe:
Convert each list item to a single line, using sed.
Do the filtering, using grep.
Convert back to the original format, using sed.
First you need to pick a character that is known not to occur in the input, and use it as separator character. For example, '|'.
Then, find the sed command for step 1, which transforms the input to the format
List item1|attrib1: someval11|attrib2: someval12|attrib3: someval13|attrib4: someval14|
List item2|attrib1: someval21|attrib2: someval12|attrib4: someval24|attrib3: someval23|
List item3|attrib1: someval31|attrib2: someval32|attrib3: someval33|attrib4: someval34|
Now step 2 is easy.

Resources