Copy specific word from a file to another file using shell script - linux

i am new to shell scripting.
my folder structure is like below format, in that every folder one file is there the file name is note.json, so i want to copy from note.json specific word like "user", i tried this for single file, it's working but showing unnecessary data and also i needed in loop format (means going to every folder doing the same) can any body help me out?
my folder structure:
drwxr-xr-x - zeppelin hdfs 0 2020-06-01 16:20 /user/zeppelin/notebook/2FBC2M3K2
drwxr-xr-x - zeppelin hdfs 0 2020-05-20 18:01 /user/zeppelin/notebook/2FBDEKUGP
drwxr-xr-x - zeppelin hdfs 0 2020-05-26 20:32 /user/zeppelin/notebook/2FBDXNZRC
drwxr-xr-x - zeppelin hdfs 0 2020-05-26 21:00 /user/zeppelin/notebook/2FBEAGZEE
drwxr-xr-x - zeppelin hdfs 0 2020-05-25 14:18 /user/zeppelin/notebook/2FBGXSHZR
drwxr-xr-x - zeppelin hdfs 0 2020-05-20 14:31 /user/zeppelin/notebook/2FBHCNKJP
drwxr-xr-x - zeppelin hdfs 0 2020-06-02 17:34 /user/zeppelin/notebook/2FBJCZ212
I tried for single folder using below command,
$ cat note.json | grep "user"
"user": "Ayan.Paul",
"data": "org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: HiveAccessControlException Permission denied: user [Ayan.Paul] does not have [USE] privilege on [snt_mmedata_upload_prd]\n\tat org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:300)\n\tat org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:286)\n\tat org.apache.hive.jdbc.HiveStatement.runAsyncOnServer(HiveStatement.java:324)\n\tat org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:265)\n\tat org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:291)\n\tat org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:291)\n\tat org.apache.zeppelin.jdbc.JDBCInterpreter.executeSql(JDBCInterpreter.java:718)\n\tat org.apache.zeppelin.jdbc.JDBCInterpreter.interpret(JDBCInterpreter.java:801)\n\tat org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:103)\n\tat org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:633)\n\tat org.apache.zeppelin.scheduler.Job.run(Job.java:188)\n\tat org.apache.zeppelin.scheduler.ParallelScheduler$JobRunner.run(ParallelScheduler.java:162)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)\n\tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)\n\tat java.lang.Thread.run(Thread.java:745)\nCaused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: HiveAccessControlException Permission denied: user [Ayan.Paul] does not have [USE] privilege on [snt_mmedata_upload_prd]\n\tat org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335)\n\tat org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:199)\n\tat org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:262)\n\tat org.apache.hive.service.cli.operation.Operation.run(Operation.java:247)\n\tat org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:541)\n\tat org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:527)\n\tat org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:315)\n\tat org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:562)\n\tat org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1557)\n\tat org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1542)\n\tat org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)\n\tat org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)\n\tat org.apache.thrift.server.TServlet.doPost(TServlet.java:83)\n\tat org.apache.hive.service.cli.thrift.ThriftHttpServlet.doPost(ThriftHttpServlet.java:208)\n\tat javax.servlet.http.HttpServlet.service(HttpServlet.java:707)\n\tat javax.servlet.http.HttpServlet.service(HttpServlet.java:790)\n\tat org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584)\n\tat org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:224)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)\n\tat org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)\n\tat org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\tat org.eclipse.jetty.server.Server.handle(Server.java:534)\n\tat org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)\n\tat org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)\n\tat org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)\n\tat org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)\n\tat org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)\n\tat org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)\n\tat org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)\n\tat org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)\n\t... 3 more\nCaused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.security.authorization.plugin.HiveAccessControlException:Permission denied: user [Ayan.Paul] does not have [USE] privilege on [snt_mmedata_upload_prd]\n\tat org.apache.ranger.authorization.hive.authorizer.RangerHiveAuthorizer.checkPrivileges(RangerHiveAuthorizer.java:483)\n\tat org.apache.hadoop.hive.ql.Driver.doAuthorizationV2(Driver.java:1330)\n\tat org.apache.hadoop.hive.ql.Driver.doAuthorization(Driver.java:1094)\n\tat org.apache.hadoop.hive.ql.Driver.compile(Driver.java:705)\n\tat org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1863)\n\tat org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1810)\n\tat org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1805)\n\tat org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:126)\n\

As said above, if it is json structured the best and clean way is to use jq.
otherwise, if this line always stay the same you can try:
cat note.json | grep "\"user\":" | sed 's/\"//g' | sed 's/,//g' | sed 's/ //g'
where
grep "\"user\":" - will take the the line you wanted
cut -d":" -f2 - will take from the second column by ":" separator
sed 's/\"//g' - remove "
sed 's/,//g' - remove commas
sed 's/ //g' - will remove spaces just in case ( you don't have to use it)
if you need the loop for it, lets say:
folder_Path='/path/to/myfolder'
files_in_folder=$(ls ${folder_Path})
for file in ${files_in_folder}
do
if [[ ${file} == "note.json" ]]
then
cat ${file} | grep "\"user\":" | sed 's/\"//g' | sed 's/,//g' | sed 's/ //g' > ${new_file_path}
fi

If you know that the note.json file always has "user" at the beginning of a line, then you can grep for that. It also sounds like you want the value of the "user" JSON field. Try using jq to parse that. Below is the "cheap and dirty" way of stripping out the extra characters. (We'll stick with a loop because you're probably doing something other things for each file...)
for file in $(find . -name note.json); do
grep "^.user" $file | cut -c 10- | tr -d '",'
done
If you want help with using jq to parse JSON, just ask a different question showing a "note.json" file and your attempt at pasring it!

Related

Shell Script to extract only file name

I have file format like this (publishfile.txt)
drwxrwx---+ h655201 supergroup 0 2019-04-24 09:16 /data/xyz/invisible/se/raw_data/OMEGA
drwxrwx---+ h655201 supergroup 0 2019-04-24 09:16 /data/xyz/invisible/se/raw_data/sample
drwxrwx---+ h655201 supergroup 0 2019-04-24 09:16 /data/xyz/invisible/se/raw_data/sample(1)
I just want to extract the name OMEGA****, sample, sample(1) How
can I do that I have used basename in my code but it doesn't work in for loop. Here is my sample code
for line in $(cat $BASE_PATH/publishfile.txt)
do
FILE_PATH=$(echo "line"| awk '{print $NF}' )
done
FILE_NAME=($basename $FILEPATH)
But this code also doesn't wor when used outside for loop
awk -F / '{ print $NF }' "$BASE_PATH"/publishfile.txt
This simply says that the delimiter is a slash and we want the last field from each line.
You should basically never run Awk on each input line in a shell while read loop (let alone a for loop); Awk itself does this by default, much faster and better than the shell.
In your code above you have a typo. Your code reads:
FILE_NAME=($basename $FILEPATH)
but it should read
FILE_NAME=$(basename $FILEPATH)
That should work fine in or outside of a loop
Try this:
cat $BASE_PATH/publishfile.txt | awk '{print $7}' | sed 's/.*\///'
the output will be:
OMEGA
sample
sample(1)
UPDATE: I guess cat x.txt | sed 's/.*\///' will still work, if all your files, folders contain at least 1 slash (/).
For the commands used, the manuals are: cat, awk, sed

How to get max directory by its name from HDFS?

The below is the directory structure of my HDFS as per hadoop 2.6.0
/user/cloudera/output_files/file_date_2016-12-27/outputfile.txt
/user/cloudera/output_files/file_date_2016-12-28/outputfile.txt
/user/cloudera/output_files/file_date_2016-12-29/outputfile.txt
..
I would like to get the max output directory by its name from a parent HDFS directory
OUTPUT_HDFS_DIR=/user/cloudera/output_files
latest_output_dir= hdfs dfs -ls -d $OUTPUT_HDFS_DIR/* | sort -n | tail -1
echo $latest_output_dir// This line is printing
latest_date_dir=$(basename "$latest_output_dir")
echo $latest_date_dir//This line is not printin. Getting a empty space.
Output of above shell script
[cloudera#client09 scripts]$ bash latest_dir.sh
drwxrwx--- - cloudera cloudera 0 2017-04-19 13:35 /user/cloudera/output_files/file_date_2016-12-29
I am expecting $latest_date_dir to be printed as file_date_2016-12-29,but it is not displaying that.
Could someone help me to fix this issue?
Change following line:
latest_output_dir= hdfs dfs -ls -d $OUTPUT_HDFS_DIR/* | sort -n | tail -1
to:
latest_output_dir=`hdfs dfs -ls -d $OUTPUT_HDFS_DIR/* | sort -n | tail -1`
Explanation: Your command will be executed but the output wont be assigned to the variable. The change which I am suggesting will do the missing part (assign it to the variable).

grep/sed copy two identic file names in a directory

I am going to execute the sed command on Mac OSX El Capitan:
grep -rl 'efefef' . | xargs sed -i ' ' "s/efefef/cccccc/g"
If I do the command the really strange thing is, if the grep command find this expression, the command is copying the file into the same directory with the SAME filename. How is it possible?!?
-rw-r--r-- 1 craphunter staff 12605 16 Okt 14:40 backend_pay.de.yml
-rw-r--r-- 1 craphunter staff 12694 15 Okt 16:41 backend_pay.de.yml
Now I do have two files with the same FILENAME in the SAME directory?!?!?
Any idea? How is it even possible?!
Thanks!
craphunter
You added a space to the backup file's name:
sed -i ' '
Use something more distinctive, like ~.

bash tail the newest file in folder without variable

I have a bunch of log files in a folder. When I cd into the folder and look at the files it looks something like this.
$ ls -lhat
-rw-r--r-- 1 root root 5.3K Sep 10 12:22 some_log_c48b72e8.log
-rw-r--r-- 1 root root 5.1M Sep 10 02:51 some_log_cebb6a28.log
-rw-r--r-- 1 root root 1.1K Aug 25 14:21 some_log_edc96130.log
-rw-r--r-- 1 root root 406K Aug 25 14:18 some_log_595c9c50.log
-rw-r--r-- 1 root root 65K Aug 24 16:00 some_log_36d179b3.log
-rw-r--r-- 1 root root 87K Aug 24 13:48 some_log_b29eb255.log
-rw-r--r-- 1 root root 13M Aug 22 11:55 some_log_eae54d84.log
-rw-r--r-- 1 root root 1.8M Aug 12 12:21 some_log_1aef4137.log
I want to look at the most recent messages in the most recent log file. I can now manually copy the name of the most recent log and then perform a tail on it and that will work.
$ tail -n 100 some_log_c48b72e8.log
This does involve manual labor so instead I would like to use bash-fu to do this.
I currently found this way to do it;
filename="$(ls -lat | sed -n 2p | tail -c 30)"; tail -n 100 $filename
It works, but I am bummed out that I need to save data into a variable to do it. Is it possible to do this in bash without saving intermediate results into a variable?
tail -n 100 "$(ls -at | head -n 1)"
You do not need ls to actually print timestamps, you just need to sort by them (ls -t). I added the -a option because it was in your original code, but note that this is not necessary unless your logfiles are "dot files", i.e. starting with a . (which they shouldn't).
Using ls this way saves you from parsing the output with sed and tail -c. (And you should not try to parse the output of ls.) Just pick the first file in the list (head -n 1), which is the newest. Putting it in quotation marks should save you from the more common "problems" like spaces in the filename. (If you have newlines or similar in your filenames, fix your filenames. :-D )
Instead of saving into a variable, you can use command substitution in-place.
A truly ls-free solution:
tail -n 100 < <(
for f in *; do
[[ $f -nt $newest ]] && newest=$f
done
cat "$newest"
)
There's no need to initialize newest, since any file will be newer than the null file named by the empty string.
It's a bit verbose, but it's guaranteed to work with any legal file name. Save it to a shell function for easier use:
tail_latest () {
dir=${1:-.}
size=${2:-100}
for f in "$dir"/*; do
[[ $f -nt $newest ]] && newest=$f
done
tail -f "$size" "$newest"
}
Some examples:
# Default of 100 lines from newest file in the current directory
tail_latest
# 200 lines from the newest file in another directory
tail_latest /some/log/dir 200
A plug for zsh: glob qualifiers let you sort the results of a glob directly, making it much easier to get the newest file.
tail -n 100 *(om[1,1])
om sorts the results by modification time (newest first). [1,1] limits the range of files matched to the first. (I think Y1 should do the same, but it kept giving me an "unknown file attribute" error.)
Without parsing ls, you'd use stat
tail -n 100 "$(stat -c "%Y %n" * | sort -nk1,1 | tail -1 | cut -d" " -f 2-)"
Will break if your filenames contain newlines.
version 2: newlines are OK
tail -n 100 "$(
stat --printf "%Y:%n\0" * |
sort -z -t: -k1,1nr |
{ IFS=: read -d '' time filename; echo "$filename"; }
)"
You can try this way also
ls -1t | head -n 1 | xargs tail -c 50
Explanation :
ls -1rht -- list the files based on modified time in reverse order.
tail -n 1 -- get the last one file
tail -c 50 -- show the last 50 character from the file.

How to get the latest filename alone in a directory?

I am using
ls -ltr /homedir/mydirectory/work/ |tail -n 1|cut -d ' ' -f 10
But this is a very crude way of getting the desired result.And also its unreliable.
The output I get on simply executing
ls -ltr /homedir/mydirectory/work/ |tail -n 1
is
-rw-r--r-- 1 user pusers 1764 Apr 1 12:06 firstfile.xml
So here I get the file name.
But if the output on doing the above command is like
-rw-r--r-- 100 user pusers 1764 Apr 1 12:06 firstfile.xml
the first command fails ! And understandably as I am cutting the result from the 10th character which does not hold valid now.
So how to refine it.
Why do you use the -l flag for ls if you don't need it? Make ls simply output the filenames if you don't need more information instead of trying to "parse" its non-unified output (raping poor text processing utilities...).
LAST_MODIFIED_FILE=`ls -tr | tail -n 1`
If you really want to achieve this using your method, then, use awk instead of cut
ls -ltr /var/log/ |tail -n 1| awk '{print $9}'
Extended user user529758 answer which can give result as per file name
use below commnad as per the file name
ls -tr Filename* | tail -n 1

Resources