How to extract domain from a text file using Ubuntu Command? - linux

I have a file of URLs, in the format as shown below:
com.blendtuts/S
°=
com.blengineering.www/:http
±=
com.blenheimgang.www/le-porsche-museum-en-details/porsche-museum-3
²=
com.blenheimsi
³=
com.blenkov.www/page/media/18/34/376
´=
com.blentwell.www/bookmarks.php/jackroldan/sp
¸=
com.blentwell.www/tags.php/I
The file size is in GigaBytes. Say around 250 GB of the file size.
I was trying to reverse the words in the file and extract only the domains from the text. I tried to make it using Ubuntu OS terminal commands.
Let me tell you what I have tried:
First I removed the data after “/” using the following command:
~$ ex -sc '%s/\(\/\).*/\1/ | x' newfile.txt > ddm.txt
And the result as:
com.blendtuts/
°=
com.blengineering.www/
±=
com.blenheimgang.www/
²=
com.blenheimsi
³=
com.blenkov.www/
´=
com.blentwell.www/
¸=
com.blentwell.www/
Now I reversed the complete text in the file using the solution from : How to reverse all the words in a file with bash in Ubuntu?
And got the following result:
/blendtuts.com
°= /www.blengineering.com
±= /www.blenheimgang.com
²= blenheimsi.com
³= /www.blenkov.com
µ= /www.blentwell.com
¶= /www.blentwell.com
•= /www.blentwell.com
/www.blentwell.com
But still the problem is not solved. I would like to how it is possible to extract URLs and put them into another file using Ubuntu. As you can see above the output what still I have is not the domain, it has a backslash with it.
If there is another solution to such a problem using any other operating system, do let me know. I prefer to go with Ubuntu.
I would like to extract domains out of the file and separate them to another file and that too in a proper format.
If I get the unique domain then it will be an excellent solution to my query. Otherwise, I am using command as:
$ sort filename.txt | uniq > save_to_file.txt
Hope to hear a solution.
Please check here is the sample file: Sample File

Please consider the following for domain extraction and reversion:
awk -F '/' '/com\./ {split($1, arr, /\W+/, seps); for (i=length(arr); i>=1; i--){s = s seps[i] arr[i];} print s ; s="";}'

Remove invalid entries, Mostly we are not interested in lines which starts with ASCII character and ends with character '='
We are interested in URL before first /
Reverse the URL
I have tried below command on your content which gives the lis of URLs
cat -v filename.txt | grep -v '^M-.=' | awk -F '/' '{print $1}' | awk -F '.' 'BEGIN{ORS="";}{ for (i=NF; i>0; i--) if ( i == 1 ) { print $i } else { print $i".";} print "\n"; }'
Output
www.blendschutzrollo.com
blendtuts.com
www.blengineering.com
www.blenheimgang.com
.
.
.

I have got this answer:
$ perl -F/ -anle 'print reverse(split("([^.]*)", $F[0])) if /\./' file_name.txt
One can refer to : https://askubuntu.com/questions/847307/how-to-do-this-in-a-single-command-on-ubuntu-16-04

Related

How to grep string and show previous word in a Linux file

i have a file with a lot of IPs and each IP have an ID, like this:
"id":340,"ip":"10.38.6.25"
"id":341,"ip":"10.38.6.26"
"id":345,"ip":"10.38.6.27"
"id":346,"ip":"110.38.6.27"
Below this Ips and after these Ips the file have more information, its a output to an API call..
I need, grep a IP and then the command shows the id, just the number. Like this:
345
EDIT: More information, the ip will be different every time, i need to pass the IP by argument. I cant parse the IP to the syntax X/X/X/X...
any ideas?
Since your current requirement is get the IDs from your broke json file, re-formatting my earlier answer.
Though I do NOT recommend this solution to get the ID, a hacky way to do this would be to use grep in PCRE mode. The way I have done the logic is to get the IP string and get the characters before it. I am not sure how to extract the digit from id alone which returns me
317,"ip":"10.38.6.2"
So using process-substitution to get the value before the first , as below.
IFS="," read -r id _< <(grep -Po ".{0,4}\"ip\":\"10.38.6.2\"" file); printf "%s\n" "$id"
317
IFS="," read -r id _< <(grep -Po ".{0,4}\"ip\":\"10.38.6.3\"" file); printf "%s\n" "$id"
318
Just add the IP you need as part of the grep string.
The below logic applies only to the your initial inputs.
Using multi-character de-limiters ; and , in awk, we can do something like:-
awk -F'[:,]' '/10\.38\.6\.27/{print $2}' file
345
A better way would be to use the match syntax equivalent to the awk // regex feature to use the variables of your choice. Provide the input IP you want in the following format.
input='"10\\.38\\.6\\.25"'
awk -F'[:,]' -v var="$input" '{ if ( match( $0, var )) {print $2};}' file
340
A more robust way to avoid matching incorrect lines would be to use " also as delimiter and do a direct match with the IP as suggested by hek2mgl.
awk -F'[:,"]' -v var="$input" '$9==var{print $4}' file
340
If you want to look up a single IP, use this:
jq ".collection|.[]|select(.ip==\"10.38.6.3\").id" data.json
If you must set IP in an argument, then write a one-liner bash script like this:
jq ".collection|.[]|select(.ip==\"$2\").id" "$1"
And call it like this:
./script data.json 10.38.6.3
grep
grep -Po ':\K\d+(?=,"ip":"xx\.xx\.xx\.xx")' file
awk -F, '/10\.38\.6\.25/ {gsub("\"","");split($1,a,":") ;print a[2]}' ip
340
or
awk -F, -v ipin="10.38.6.25" '$0 ~ ipin {gsub("\"","");split($1,a,":") ;print a[2]}' ip
$ awk -F, -v grep="10.38.6.26" '$2 ~ "\"" grep "\"" && sub(/^.*:/,"",$1) {print $1}' foo
341
Grep, SED, and AWK are inappropriate tools for JSON parsing. You whether need a tool specially designed for working with JSON data (e.g. jq), or write a script in a language that supports JSON parsing in one way, or another (examples: PHP, Perl, JavaScript).
JQ
One of the easiest ways is to use the jq tool (as mentioned in the comments to the question), e.g.:
jq '.collection[] | if .ip == "10.38.6.3" then .id else empty end' < file.json
PHP
Alternatively, you can write a simple tool in PHP, for example. PHP has a built-in JSON support.
ip-ids.php
<?php
$ip = trim($argv[1]);
$json = file_get_contents('file.json');
$json = json_decode($json, true);
foreach ($json['collection'] as $e) {
if ($e['ip'] == $ip)
echo $e['id'], PHP_EOL;
}
(sanity checks are skipped for the sake of simplicity)
Usage
php ip-ids.php '10.38.6.3'
Node.js
If you have Node installed, the following script can be used as a universal solution. You can pass any IP as the first argument, and the script will output a list of corresponding IDs.
ip-ids.js
#!/usr/bin/node
var fs = require('fs');
var ip = process.argv[2];
var json = fs.readFileSync('file.json', 'utf-8');
json = JSON.parse(json);
for (var i = 0; i < json.collection.length; i++) {
if (json.collection[i]['ip'] === ip)
console.log(json.collection[i]['id']);
}
Usage
node ip-ids.js '10.38.6.3'
or, if the executable permissions are set (chmod +x ip-ids.js):
./ip-ids.js '10.38.6.3'
Note, I have skipped sanity checks in the script for the sake of simplicity.
Conclusion
Now you can see that it is pretty easy to use jq. Scripting solutions are slightly more verbose, but not too difficult as well. Both approaches are flexible. You don't have to rely on positions of sub-strings in the JSON string, or to resort to hacks that you will most likely forget after a couple of weeks. The script solutions are reliable and readable (and thus easily maintainable), as opposed to tricky AWK/GREP/SED expressions.
Original answer
This is the original answer for the case of a file in the following format (I didn't know that the input is in JSON format). Still, this solution seems to work even with the partial JSON you currently pasted into the question.
"id":340,"ip":"10.38.6.25"
"id":341,"ip":"10.38.6.26"
"id":345,"ip":"10.38.6.27"
Perl version:
perl -ne '/"id":(\d+).*"ip":"10\.38\.6\.27"/ and print "$1\n"' file
You example is not valid JSON. In order to get valid JSON you have to add curly braces. This is done by the sed in the following example.
$ sed 's/^/{/;s/$/}/' <<EOF | jq -s 'map(select(.ip == "10.38.6.27")) | map(.id) | .[]'
> "id":340,"ip":"10.38.6.25"
> "id":341,"ip":"10.38.6.26"
> "id":345,"ip":"10.38.6.27"
> "id":346,"ip":"110.38.6.27"
> EOF
345
Normally jq reads just one object. With the option -s jq reads all objects, because you have a list input. The first map iterates over the list and selects only those objects with the matching attribute ip. This is the same as a grep. The second map takes just the id attribute from the result and the final .[] the the opposite to the -s option.
If you can make your json pretty and then do cat file, below command might help
cat /tmp/file|grep -B 1 "ipaddress"|grep -w id|tr ' ' '\0'|cut -d: -f2|cut -d, -f1

Modification of file names

I have a list of more than 1000 files on the following format.
0521865417_roman_pottery_in_the_archaeological_record_2007.pdf
0521865476_power_politics_and_religion_in_timurid_iran_2007.pdf
0521865514_toward_a_theory_of_human_rights_religion_law_courts_2006.pdf
0521865522_i_was_wrong_the_meanings_of_apologies_2008.pdf
I am on Linux and want to change them as follows
2007_roman_pottery_in_the_archaeological_record.pdf
2007_power_politics_and_religion_in_timurid_iran.pdf
2006_toward_a_theory_of_human_rights_religion_law_courts.pdf
2008_i_was_wrong_the_meanings_of_apologies.pdf
Using rename and awk I managed to get
2007_roman_pottery_in_the_archaeological_record_2007.pdf
2007_power_politics_and_religion_in_timurid_iran_2007.pdf
2006_toward_a_theory_of_human_rights_religion_law_courts_2006.pdf
2008_i_was_wrong_the_meanings_of_apologies_2008.pdf
The remaining task is now to remove the last field that holds the year.
A solution that uses sed to generate the new names and the rename commands then pipes them to bash:
ls -1 | sed -r 's/[0-9]*_([A-Za-z_]*)_[a-z]{3}_([0-9]{4})\.pdf$/mv & \2_\1.pdf/g' | bash
A work around from where you left of...
echo 2007_roman_pottery_in_the_archaeological_record_2007.pdf | awk -F '_' '{$NF=""; OFS="_"; print substr($0, 0, length($0)-1)".pdf";}'

renaming files using loop in unix

I have a situation here.
I have lot of files like below in linux
SIPTV_FIPTV_ID00$line_T20141003195717_C0000001000_FWD148_IPV_001.DATaac
SIPTV_FIPTV_ID00$line_T20141003195717_C0000001000_FWD148_IPV_001.DATaag
I want to remove the $line and make a counter from 0001 to 6000 for my 6000 such files in its place.
Also i want to remove the trailer 3 characters after this is done for each file.
After fix file should be like
SIPTV_FIPTV_ID0000001_T20141003195717_C0000001000_FWD148_IPV_001.DAT
SIPTV_FIPTV_ID0000002_T20141003195717_C0000001000_FWD148_IPV_001.DAT
Please help.
With some assumption, I think this should do it:
1. list of the files is in a file named input.txt, one file per line
2. the code is running in the directory the files are in
3. bash is available
awk '{i++;printf "mv \x27"$0"\x27 ";printf "\x27"substr($0,1,16);printf "%05d", i;print substr($0,22,47)"\x27"}' input.txt | bash
from the command prompt give the following command
% echo *.DAT??? | awk '{
old=$0;
sub("\\$line",sprintf("%4.4d",++n));
sub("...$","");
print "mv", old, $1}'
%
and check the output, if it looks OK
% echo *.DAT??? | awk '{
old=$0;
sub("\\$line",sprintf("%4.4d",++n));
sub("...$","");
print "mv", old, $1}' | sh
%
A commentary: echo *.DAT??? is meant to give as input to awk a list of all the filenames that you want to modify, you may want something more articulated if the example names you gave aren't representative of the whole spectrum... regarding the awk script itself, I used sprintf to generate a string with the correct number of zeroes for the replacement of $line, the idiom `"\\$..." with two backslashes to quote the dollar sign is required by gawk and does no harm in mawk, and as a last remark I have to say that in similar cases I prefer to make at least a dry run before passing the commands to the shell...

Shell Scripting - URL manipulation

I need to manipulate a URL from the values from a file. This is what I could do
var=$(grep -A2 -i "some_text" /path/to/file | grep -v "some_text" | cut -d'"' -f 4-5 | cut -d'"' -f 1 | tr -d '\n')
This will give output : /text/to/be/appended/to/domain
Now, I need to append the domain name to var value.
So I did,
var1="http://mydomain"
and then
echo ${var1}${var}
So I expect
http://mydomain/text/to/be/appended/to/domain
to be the output. But am getting just /text/to/be/appended/to/domain.
I guessed it'd be due to the / as the first char, but if i use cut to remove the first /, am getting value of var1 as output.
Where did I go wrong?
Update (not sure if this would help even a bit, still) :
If I do echo ${var}${var1}, am getting /text/to/be/appended/to/domainhttp://mydomain
Sample entry :
<tr><td><a id="value">some_text</a></td></tr>
<tr><td><a id="value" href="/text/to/be/appended/to/domain">2013</a></td></tr>
this line ending (^M) points that at some point the file was edited(created) in dos like environment. Use "dos2unix yourfile" to fix the problem. BOTH your script and the sample entries.

How to find the particular text stored in the file "data.txt" and it occurs only once

The line I seek is stored in the file data.txt and is the only line of text that occurs only once.
How do I go about finding that particular line using linux?
This is a little bit old, but I think you are looking for this...
cat data.txt | sort | uniq -u
This will show the unique values that only occur once in the file. I assume you are familiar with "over the wire" if you are asking?? If so, this is what you are looking for.
To provide some context (I need more rep to comment) this is a question that features in an online "wargame" called Bandit that involves using the command line to discover passwords on an online Linux server to advance up the levels.
For those who would like to see data.txt in full I've Pastebin'd it here however it looks like this:
NN4e37KW2tkIb3dC9ZHyOPdq1FqZwq9h
jpEYciZvDIs6MLPhYoOGWQHNIoQZzE5q
3rpovhi1CyT7RUTunW30goGek5Q5Fu66
JOaWd4uAPii4Jc19AP2McmBNRzBYDAkO
JOaWd4uAPii4Jc19AP2McmBNRzBYDAkO
9WV67QT4uZZK7JHwmOH0jnhurJMwoGZU
a2GjmWtTe3tTM0ARl7TQwraPGXgfkH4f
7yJ8imXc7NNiovDuAl1ZC6xb0O0mMBx1
UsvVyFSfZZWbi6wgC7dAFyFuR6jQQUhR
FcOJhZkHlnwqcD8QbvjRyn886rCrnWZ7
E3ugYDa6Wh2y8C8xQev7vOS8O3OgG1Hw
E3ugYDa6Wh2y8C8xQev7vOS8O3OgG1Hw
ME7nnzbId4W3dajsl6Xtviyl5uhmMenv
J5lN3Qe4s7ktiwvcCj9ZHWrAJcUWEhUq
aouHvjzagN8QT2BCMB6e9rlN4ffqZ0Qq
ZRF5dlSuwuVV9TLhHKvPvRDrQ2L5ODfD
9ZjR3NTHue4YR6n4DgG5e0qMQcJjTaiM
QT8Bw9ofH4x3MeRvYAVbYvV1e1zq3Xim
i6A6TL6nqvjCAPvOdXZWjlYgyvqxmB7k
tx7tQ6kgeJnC446CHbiJY7fyRwrwuhrs
One way to do it is to use:
sort data.txt | uniq -u
The sort command is like cat in that it displays the contents of the file however it sorts the file lexicographically by lines (it reorders them alphabetically so that matching ones are together).
The | is a pipe that redirects the output from one command into another.
The uniq command reports or omits repeated lines and by passing it the -u argument we tell it to report only unique lines.
Used together like this, the command will sort data.txt lexicographically by each line, find the unique line and print it back in the terminal for you.
sort -u data.txt | while read line; do if [ $(grep -c $line data.txt) == 1 ] ;then echo $line; fi; done
was mine solution, until I saw here easy one:
sort data.txt | uniq -u
Add more information to you post.
How data.txt look like?
Like this:
11111111
11111111
pass1111
11111111
Or like this
afawfdgd
password
somethin
gelse...
And, do you know the password is in file or you search for not repeat string.
If you know password, use something like this
cat data.txt | grep 'password'
If you don`t know the password and this password is only unique line in file you must create a script.
For example in Python
file = open("data.txt","r")
f = file.read()
for line in f:
if 'pass' in line:
print pass
Of course replace pass with something else.
For example some slice from line.
And one with only one tool in use, awk:
awk '{a[$1]++}END{for(i in a){if(a[i] == 1){print i} }}' data.txt
sort data.txt | uniq -c | grep 1\ ?*
and it will print the only text that occurs only one time
do not forget to put space after the backslash
sort data.txt | uniq -c | grep 1
you will find only one that accures one time

Resources