grep limited characters - one line - linux

I want to look up a word in multiple files, and return only a single line per result, or a limited number of characters (40 ~ 80 characters for example), and not the entire line, as by default.
grep -sR 'wp-content' .
file_1.sql:3309:blog/wp-content
file_1.sql:3509:blog/wp-content
file_2.sql:309:blog/wp-content
Currently I see the following:
grep -sR 'wp-content' .
file_1.sql:3309:blog/wp-content-Progressively predominate impactful systems without resource-leveling best practices. Uniquely maximize virtual channels and inexpensive results. Uniquely procrastinate multifunctional leadership skills without visionary systems. Continually redefine prospective deliverables without.
file_1.sql:3509:blog/wp-content-Progressively predominate impactful systems without resource-leveling best practices. Uniquely maximize virtual channels and inexpensive results. Uniquely procrastinate multifunctional leadership skills without visionary systems. Continually redefine prospective deliverables without.
file_2.sql:309:blog/wp-content-Progressively predominate impactful systems without resource-leveling best practices. Uniquely maximize virtual channels and inexpensive results. Uniquely procrastinate multifunctional leadership skills without visionary systems. Continually redefine prospective deliverables without.

You could use a combination of grep and cut
Using your example I would use:
grep -sRn 'wp-content' .|cut -c -40
grep -sRn 'wp-content' .|cut -c -80
That would give you the first 40 or 80 characters respectively.
edit:
Also, theres a flag in grep, that you could use:
-m NUM, --max-count=NUM
Stop reading a file after NUM matching lines.
This with a combination of what I previously wrote:
grep -sRnm 1 'wp-content' .|cut -c -40
grep -sRnm 1 'wp-content' .|cut -c -80
That should give you the first time it appears per file, and only the first 40 or 80 chars.

egrep -Rso '.{0,40}wp-content.{0,40}' *.sh
This will not call the Radio-Symphonie-Orchestra, but -o(nly matching).
A maximum of 40 characters before and behind your pattern. Note: *e*grep.

If you change the regex to '^.*wp-content' you can use egrep -o. For example,
egrep -sRo '^.*wp-content' .
The -o flag make egrep only print out the portion of the line that matches. So matching from the start of line to wp-content should yield the sample output in your first code block.

Related

How get unique lines from a very large file in linux?

I have a very large data file (255G; 3,192,563,934 lines). Unfortunately I only have 204G of free space on the device (and no other devices I can use). I did a random sample and found that in a given, say, 100K lines, there are about 10K unique lines... but the file isn't sorted.
Normally I would use, say:
pv myfile.data | sort | uniq > myfile.data.uniq
and just let it run for a day or so. That won't work in this case because I don't have enough space left on the device for the temporary files.
I was thinking I could use split, perhaps, and do a streaming uniq on maybe 500K lines at a time into a new file. Is there a way to do something like that?
I thought I might be able to do something like
tail -100000 myfile.data | sort | uniq >> myfile.uniq && trunc --magicstuff myfile.data
but I couldn't figure out a way to truncate the file properly.
Use sort -u instead of sort | uniq
This allows sort to discard duplicates earlier, and GNU coreutils is smart enough to take advantage of this.

Documentation for uinput

I am trying very hard to find the documentation for the uinput but the only thing I have found was the linux/uinput.h. I have also found some tutorials on the internet but no documentation at all!
For example I would like to know what UI_SET_MSCBIT does but I can't find anything about it.
How does people know how to use uinput?
Well, it takes some investigation effort for such subtle things. From
drivers/input/misc/uinput.c and include/uapi/linux/uinput.h files you can see bits for UI_SET_* definitions, like this:
MSC
REL
LED
etc.
Run next command in kernel sources directory:
$ git grep --all-match -e 'MSC' -e 'REL' -e 'LED' -- Documentation/*
or use regular grep, if your kernel doesn't have .git directory:
$ grep -rl MSC Documentation/* | xargs grep -l REL | xargs grep -l LED
You'll get this file: Documentation/input/event-codes.txt, from which you can see:
EV_MSC: Used to describe miscellaneous input data that do not fit into other types.
EV_MSC events are used for input and output events that do not fall under other categories.
A few EV_MSC codes have special meaning:
MSC_TIMESTAMP: Used to report the number of microseconds since the last reset. This event should be coded as an uint32 value, which is allowed to wrap around with no special consequence. It is assumed that the time difference between two consecutive events is reliable on a reasonable time scale (hours). A reset to zero can happen, in which case the time since the last event is unknown. If the device does not provide this information, the driver must not provide it to user space.
I'm afraid this is the best you can find out there for UI_SET_MSCBIT out there.

Random String in linux by system time

I work with Bash. I want to generate randrom string by system time . The length of the unique string must be between 10 and 30 characters.Can anybody help me?
There are many ways to do this, my favorite one using the urandom device:
burhan#sandbox:~$ tr -cd '[:alnum:]' < /dev/urandom | fold -w30 | head -n1
CCI4zgDQ0SoBfAp9k0XeuISJo9uJMt
tr (translate) makes sure that only alphanumerics are shown
fold will wrap it to 30 character width
head makes sure we get only the first line
To use the current system time (as you have this specific requirement):
burhan#sandbox:~$ date +%s | sha256sum | base64 | head -c30; echo
NDc0NGQxZDQ4MWNiNzBjY2EyNGFlOW
date +%s = this is our date based seed
We run it through a few hashes to get a "random" string
Finally we truncate it to 30 characters
Other ways (including the two I listed above) are available at this page and others if you simply google.
Maybe you can use uuidgen -t.
Generate a time-based UUID. This method creates a UUID based on the system clock plus the system's ethernet hardware address, if present.
I recently put together a script to handle this, the output is 33 digit md5 checksum but you can trim it down with sed to between 10-30.
E.g. gen_uniq_id.bsh | sed 's/\(.\{20\}\)\(.*$\)/\1/'
The script is fairly robust - it uses current time to nanoseconds, /dev/urandom, mouse movement data and allows for optionally changing the collection times for random and mouse data collection.
It also has a -s option that allows an additional string argument to be incorporated, so you can random seed from anything.
https://code.google.com/p/gen-uniq-id/

Finding non SIMILAR lines on Solaris (or Linux) in two files

I'm trying to compare 2 files on a Solaris box and only see the lines that are not similar. I know that I can use the command given below to find lines that are not exact matches, but that isn't good enough for what I'm try to do.
comm -12 <(sort FILE1.txt | uniq) <(sort FILE2.txt | uniq) > diff.txt
For the purposes of this question I would define simlar as having the same characters ~80% of the time, but completely ignoring locations that differ (since the sections that differ may also differ in length). The locations that differ can be assumed to occur at roughly the same point in the line. In other words once we find a location that differs we have to figure out when to start comparing again.
I know this is a hard problem to solve and will appreciate any help/ideas.
EDIT:
Example input 1:
Abend for SP EAOJH with account s03284fjw and client pewaj39023eipofja,level.error
Exception Invalid account type requested: 134029830198,level.fatal
Only in file 1
Example input 2:
Exception Invalid account type requested: 1307230,level.fatal
Abend for SP EREOIWS with account 32192038409aoewj and client eowaji30948209,level.error
Example output:
Only in file 1
I am also realizing that it would be ideal if the files were not read into memory all at once since they can be nearly 100 gigs. Perhaps perl would be better than bash because of this need.

diff folders recursively vs. multithreading

I need to compare two directory structures with around one billion files each (directory deepness up to 20 levels)
I found usual diff -r /location/one /location/two slow.
Is there any implementation of multithreading diff? Or is it doable via combining shell and diff together? If so, how?
Your disk is gonna be the bottleneck.
Unless you are working on tmpfs, you will probably only loose speed. That said:
find -maxdepth 1 -type d -print0 |
xargs -0P4 -n1 -iDIRNAME diff -EwburqN "DIRNAME/" "/tmp/othertree/DIRNAME/"
should do a pretty decent job of comparing trees (in this case . to /tmp/othertree).
It has a flaw right now, in that it won't detect toplevel directories in otherthree that don't exist in .. I leave that as an exercise for the reader - though you could easily repeat the comparison in reverse
The argument -P4 to xargs specifies that you want at most 4 concurrent processes.
Also have look at the xjobs utitlity which does a better job at separating the output. I think with GNU xargs (like shown) you cannot drop the -q option because it will intermix the diffs (?).

Resources