How to extract domain from a text file using Ubuntu Command?

I have a file of URLs, in the format as shown below:
The file size is in GigaBytes. Say around 250 GB of the file size.
I was trying to reverse the words in the file and extract only the domains from the text. I tried to make it using Ubuntu OS terminal commands.
Let me tell you what I have tried:
First I removed the data after “/” using the following command:
~$ ex -sc '%s/\(\/\).*/\1/ | x' newfile.txt > ddm.txt
And the result as:
Now I reversed the complete text in the file using the solution from : How to reverse all the words in a file with bash in Ubuntu?
And got the following result:
°= /
±= /
³= /
µ= /
¶= /
•= /
But still the problem is not solved. I would like to how it is possible to extract URLs and put them into another file using Ubuntu. As you can see above the output what still I have is not the domain, it has a backslash with it.
If there is another solution to such a problem using any other operating system, do let me know. I prefer to go with Ubuntu.
I would like to extract domains out of the file and separate them to another file and that too in a proper format.
If I get the unique domain then it will be an excellent solution to my query. Otherwise, I am using command as:
$ sort filename.txt | uniq > save_to_file.txt
Hope to hear a solution.
Please check here is the sample file: Sample File

Please consider the following for domain extraction and reversion:
awk -F '/' '/com\./ {split($1, arr, /\W+/, seps); for (i=length(arr); i>=1; i--){s = s seps[i] arr[i];} print s ; s="";}'

Remove invalid entries, Mostly we are not interested in lines which starts with ASCII character and ends with character '='
We are interested in URL before first /
Reverse the URL
I have tried below command on your content which gives the lis of URLs
cat -v filename.txt | grep -v '^M-.=' | awk -F '/' '{print $1}' | awk -F '.' 'BEGIN{ORS="";}{ for (i=NF; i>0; i--) if ( i == 1 ) { print $i } else { print $i".";} print "\n"; }'

I have got this answer:
$ perl -F/ -anle 'print reverse(split("([^.]*)", $F[0])) if /\./' file_name.txt
i have a file with a lot of IPs and each IP have an ID, like this:
Below this Ips and after these Ips the file have more information, its a output to an API call..
I need, grep a IP and then the command shows the id, just the number. Like this:
EDIT: More information, the ip will be different every time, i need to pass the IP by argument. I cant parse the IP to the syntax X/X/X/X...
any ideas?
Since your current requirement is get the IDs from your broke json file, re-formatting my earlier answer.
Though I do NOT recommend this solution to get the ID, a hacky way to do this would be to use grep in PCRE mode. The way I have done the logic is to get the IP string and get the characters before it. I am not sure how to extract the digit from id alone which returns me
So using process-substitution to get the value before the first , as below.
IFS="," read -r id _< <(grep -Po ".{0,4}\"ip\":\"\"" file); printf "%s\n" "$id"
IFS="," read -r id _< <(grep -Po ".{0,4}\"ip\":\"\"" file); printf "%s\n" "$id"
Just add the IP you need as part of the grep string.
The below logic applies only to the your initial inputs.
Using multi-character de-limiters ; and , in awk, we can do something like:-
awk -F'[:,]' '/10\.38\.6\.27/{print $2}' file
A better way would be to use the match syntax equivalent to the awk // regex feature to use the variables of your choice. Provide the input IP you want in the following format.
awk -F'[:,]' -v var="$input" '{ if ( match( $0, var )) {print $2};}' file
A more robust way to avoid matching incorrect lines would be to use " also as delimiter and do a direct match with the IP as suggested by hek2mgl.
awk -F'[:,"]' -v var="$input" '$9==var{print $4}' file
If you want to look up a single IP, use this:
jq ".collection|.[]|select(.ip==\"\").id" data.json
If you must set IP in an argument, then write a one-liner bash script like this:
jq ".collection|.[]|select(.ip==\"$2\").id" "$1"
And call it like this:
./script data.json
grep -Po ':\K\d+(?=,"ip":"xx\.xx\.xx\.xx")' file
awk -F, '/10\.38\.6\.25/ {gsub("\"","");split($1,a,":") ;print a[2]}' ip
awk -F, -v ipin="" '$0 ~ ipin {gsub("\"","");split($1,a,":") ;print a[2]}' ip
$ awk -F, -v grep="" '$2 ~ "\"" grep "\"" && sub(/^.*:/,"",$1) {print $1}' foo
Grep, SED, and AWK are inappropriate tools for JSON parsing. You whether need a tool specially designed for working with JSON data (e.g. jq), or write a script in a language that supports JSON parsing in one way, or another (examples: PHP, Perl, JavaScript).
One of the easiest ways is to use the jq tool (as mentioned in the comments to the question), e.g.:
jq '.collection[] | if .ip == "" then .id else empty end' < file.json
Alternatively, you can write a simple tool in PHP, for example. PHP has a built-in JSON support.
$ip = trim($argv[1]);
$json = file_get_contents('file.json');
$json = json_decode($json, true);
foreach ($json['collection'] as $e) {
if ($e['ip'] == $ip)
echo $e['id'], PHP_EOL;
(sanity checks are skipped for the sake of simplicity)
php ip-ids.php ''
If you have Node installed, the following script can be used as a universal solution. You can pass any IP as the first argument, and the script will output a list of corresponding IDs.
var fs = require('fs');
var ip = process.argv[2];
var json = fs.readFileSync('file.json', 'utf-8');
json = JSON.parse(json);
for (var i = 0; i < json.collection.length; i++) {
if (json.collection[i]['ip'] === ip)
node ip-ids.js ''
or, if the executable permissions are set (chmod +x ip-ids.js):
./ip-ids.js ''
Note, I have skipped sanity checks in the script for the sake of simplicity.
Now you can see that it is pretty easy to use jq. Scripting solutions are slightly more verbose, but not too difficult as well. Both approaches are flexible. You don't have to rely on positions of sub-strings in the JSON string, or to resort to hacks that you will most likely forget after a couple of weeks. The script solutions are reliable and readable (and thus easily maintainable), as opposed to tricky AWK/GREP/SED expressions.
Original answer
This is the original answer for the case of a file in the following format (I didn't know that the input is in JSON format). Still, this solution seems to work even with the partial JSON you currently pasted into the question.
Perl version:
perl -ne '/"id":(\d+).*"ip":"10\.38\.6\.27"/ and print "$1\n"' file
You example is not valid JSON. In order to get valid JSON you have to add curly braces. This is done by the sed in the following example.
$ sed 's/^/{/;s/$/}/' <<EOF | jq -s 'map(select(.ip == "")) | map(.id) | .[]'
> "id":340,"ip":""
> "id":341,"ip":""
> "id":345,"ip":""
> "id":346,"ip":""
Normally jq reads just one object. With the option -s jq reads all objects, because you have a list input. The first map iterates over the list and selects only those objects with the matching attribute ip. This is the same as a grep. The second map takes just the id attribute from the result and the final .[] the the opposite to the -s option.
If you can make your json pretty and then do cat file, below command might help
cat /tmp/file|grep -B 1 "ipaddress"|grep -w id|tr ' ' '\0'|cut -d: -f2|cut -d, -f1

Modification of file names

I have a list of more than 1000 files on the following format.
I am on Linux and want to change them as follows
Using rename and awk I managed to get
The remaining task is now to remove the last field that holds the year.
A solution that uses sed to generate the new names and the rename commands then pipes them to bash:
ls -1 | sed -r 's/[0-9]*_([A-Za-z_]*)_[a-z]{3}_([0-9]{4})\.pdf$/mv & \2_\1.pdf/g' | bash
A work around from where you left of...
echo 2007_roman_pottery_in_the_archaeological_record_2007.pdf | awk -F '_' '{$NF=""; OFS="_"; print substr($0, 0, length($0)-1)".pdf";}'

renaming files using loop in unix

I have a situation here.
I have lot of files like below in linux
I want to remove the $line and make a counter from 0001 to 6000 for my 6000 such files in its place.
Also i want to remove the trailer 3 characters after this is done for each file.
After fix file should be like
Please help.
With some assumption, I think this should do it:
1. list of the files is in a file named input.txt, one file per line
2. the code is running in the directory the files are in
3. bash is available
awk '{i++;printf "mv \x27"$0"\x27 ";printf "\x27"substr($0,1,16);printf "%05d", i;print substr($0,22,47)"\x27"}' input.txt | bash
from the command prompt give the following command
% echo *.DAT??? | awk '{
print "mv", old, $1}'
and check the output, if it looks OK
% echo *.DAT??? | awk '{
print "mv", old, $1}' | sh
A commentary: echo *.DAT??? is meant to give as input to awk a list of all the filenames that you want to modify, you may want something more articulated if the example names you gave aren't representative of the whole spectrum... regarding the awk script itself, I used sprintf to generate a string with the correct number of zeroes for the replacement of $line, the idiom `"\\$..." with two backslashes to quote the dollar sign is required by gawk and does no harm in mawk, and as a last remark I have to say that in similar cases I prefer to make at least a dry run before passing the commands to the shell...

Shell Scripting - URL manipulation

I need to manipulate a URL from the values from a file. This is what I could do
var=$(grep -A2 -i "some_text" /path/to/file | grep -v "some_text" | cut -d'"' -f 4-5 | cut -d'"' -f 1 | tr -d '\n')
This will give output : /text/to/be/appended/to/domain
Now, I need to append the domain name to var value.
So I did,
and then
echo ${var1}${var}
So I expect
to be the output. But am getting just /text/to/be/appended/to/domain.
I guessed it'd be due to the / as the first char, but if i use cut to remove the first /, am getting value of var1 as output.
Where did I go wrong?
Update (not sure if this would help even a bit, still) :
If I do echo ${var}${var1}, am getting /text/to/be/appended/to/domainhttp://mydomain
Sample entry :
<tr><td><a id="value">some_text</a></td></tr>
<tr><td><a id="value" href="/text/to/be/appended/to/domain">2013</a></td></tr>
this line ending (^M) points that at some point the file was edited(created) in dos like environment. Use "dos2unix yourfile" to fix the problem. BOTH your script and the sample entries.

How to find the particular text stored in the file "data.txt" and it occurs only once

The line I seek is stored in the file data.txt and is the only line of text that occurs only once.
How do I go about finding that particular line using linux?
This is a little bit old, but I think you are looking for this...
cat data.txt | sort | uniq -u
This will show the unique values that only occur once in the file. I assume you are familiar with "over the wire" if you are asking?? If so, this is what you are looking for.
To provide some context (I need more rep to comment) this is a question that features in an online "wargame" called Bandit that involves using the command line to discover passwords on an online Linux server to advance up the levels.
For those who would like to see data.txt in full I've Pastebin'd it here however it looks like this:
One way to do it is to use:
sort data.txt | uniq -u
The sort command is like cat in that it displays the contents of the file however it sorts the file lexicographically by lines (it reorders them alphabetically so that matching ones are together).
The | is a pipe that redirects the output from one command into another.
The uniq command reports or omits repeated lines and by passing it the -u argument we tell it to report only unique lines.
Used together like this, the command will sort data.txt lexicographically by each line, find the unique line and print it back in the terminal for you.
sort -u data.txt | while read line; do if [ $(grep -c $line data.txt) == 1 ] ;then echo $line; fi; done
was mine solution, until I saw here easy one:
sort data.txt | uniq -u
Add more information to you post.
How data.txt look like?
Like this:
Or like this
And, do you know the password is in file or you search for not repeat string.
If you know password, use something like this
cat data.txt | grep 'password'
If you don`t know the password and this password is only unique line in file you must create a script.
For example in Python
file = open("data.txt","r")
f =
for line in f:
if 'pass' in line:
print pass
Of course replace pass with something else.
For example some slice from line.
And one with only one tool in use, awk:
awk '{a[$1]++}END{for(i in a){if(a[i] == 1){print i} }}' data.txt
sort data.txt | uniq -c | grep 1\ ?*
and it will print the only text that occurs only one time
do not forget to put space after the backslash
sort data.txt | uniq -c | grep 1
you will find only one that accures one time
