how to detect invalid utf8 unicode/binary in a text file - linux

I need to detect corrupted text file where there are invalid (non-ASCII) utf-8, Unicode or binary characters.
�>t�ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½w�ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½o��������ï¿ï¿½_��������������������o����������������������￿����ß����������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½~�ï¿ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½}���������}w��׿��������������������������������������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½~������������������������������������_������������������������������������������������������������������������������^����ï¿ï¿½s�����������������������������?�������������ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½w�������������ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½}����������ï¿ï¿½ï¿½ï¿½ï¿½y����������������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½o�������������������������}��
what I have tried:
iconv -f utf-8 -t utf-8 -c file.csv
this converts a file from utf-8 encoding to utf-8 encoding and -c is for skipping invalid utf-8 characters. However at the end those illegal characters still got printed. Are there any other solutions in bash on linux or other languages?

Assuming you have your locale set to UTF-8 (see locale output), this works well to recognize invalid UTF-8 sequences:
grep -axv '.*' file.txt
Explanation (from grep man page):
-a, --text: treats file as text, essential prevents grep to abort once finding an invalid byte sequence (not being utf8)
-v, --invert-match: inverts the output showing lines not matched
-x '.*' (--line-regexp): means to match a complete line consisting of any utf8 character.
Hence, there will be output, which is the lines containing the invalid not utf8 byte sequence containing lines (since inverted -v)

I would grep for non ASCII characters.
With GNU grep with pcre (due to -P, not available always. On FreeBSD you can use pcregrep in package pcre2) you can do:
grep -P "[\x80-\xFF]" file
Reference in How Do I grep For all non-ASCII Characters in UNIX. So, in fact, if you only want to check whether the file contains non ASCII characters, you can just say:
if grep -qP "[\x80-\xFF]" file ; then echo "file contains ascii"; fi
# ^
# silent grep
To remove these characters, you can use:
sed -i.bak 's/[\d128-\d255]//g' file
This will create a file.bak file as backup, whereas the original file will have its non ASCII characters removed. Reference in Remove non-ascii characters from csv.

Try this, in order to find non-ASCII characters from the shell.
Command:
$ perl -ne 'print "$. $_" if m/[\x80-\xFF]/' utf8.txt
Output:
2 Pour être ou ne pas être
4 Byť či nebyť
5 是或不

What you are looking at is by definition corrupted. Apparently, you are displaying the file as it is rendered in Latin-1; the three characters � represent the three byte values 0xEF 0xBF 0xBD. But those are the UTF-8 encoding of the Unicode REPLACEMENT CHARACTER U+FFFD which is the result of attempting to convert bytes from an unknown or undefined encoding into UTF-8, and which would properly be displayed as � (if you have a browser from this century, you should see something like a black diamond with a question mark in it; but this also depends on the font you are using etc).
So your question about "how to detect" this particular phenomenon is easy; the Unicode code point U+FFFD is a dead giveaway, and the only possible symptom from the process you are implying.
These are not "invalid Unicode" or "invalid UTF-8" in the sense that this is a valid UTF-8 sequence which encodes a valid Unicode code point; it's just that the semantics of this particular code point is "this is a replacement character for a character which could not be represented properly", i.e. invalid input.
As for how to prevent it in the first place, the answer is really simple, but also rather uninformative -- you need to identify when and how the incorrect encoding took place, and fix the process which produced this invalid output.
To just remove the U+FFFD characters, try something like
perl -CSD -pe 's/\x{FFFD}//g' file
but again, the proper solution is to not generate these erroneous outputs in the first place.
To actually answer the question about how to remove only invalid code points, try
iconv -f UTF-8 -t UTF-8//IGNORE broken-utf8.txt >fixed-utf8.txt
(You are not revealing the encoding of your example data. It is possible that it has an additional corruption. If what you are showing us is a copy/paste of the UTF-8 rendering of the data, it has been "double-encoded". In other words, somebody took -- already corrupted, as per the above -- UTF-8 text and told the computer to convert it from Latin-1 to UTF-8. Undoing that is easy; just convert it "back" to Latin-1. What you obtain should then be the original UTF-8 data before the superfluous incorrect conversion.
iconv -f utf-8 -t latin-1 mojibake-utf8.txt >fixed-utf8.txt
See also mojibake.)

This Perl program should remove all non-ASCII characters:
foreach $file (#ARGV) {
open(IN, $file);
open(OUT, "> super-temporary-utf8-replacement-file-which-should-never-be-used-EVER");
while (<IN>) {
s/[^[:ascii:]]//g;
print OUT "$_";
}
rename "super-temporary-utf8-replacement-file-which-should-never-be-used-EVER", $file;
}
What this does is it takes files as input on the command-line, like so: perl fixutf8.pl foo bar baz
Then, for each line, it replaces each instance of a non-ASCII character with nothing (deletion).
It then writes this modified line out to super-temporary-utf8-replacement-file-which-should-never-be-used-EVER (named so it dosen't modify any other files.)
Afterwards, it renames the temporary file to that of the original one.
This accepts ALL ASCII characters (including DEL, NUL, CR, etc.), in case you have some special use for them. If you want only printable characters, simply replace :ascii: with :print: in s///.
I hope this helps! Please let me know if this wasn't what you were looking for.

The following C program detects invalid utf8 characters.
It was tested and used on a linux system.
/*
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
*/
#include <stdio.h>
#include <stdlib.h>
void usage( void ) {
printf( "Usage: test_utf8 file ...\n" );
return;
}
int line_number = 1;
int char_number = 1;
char *file_name = NULL;
void inv_char( void ) {
printf( "%s: line : %d - char %d\n", file_name, line_number, char_number );
return;
}
int main( int argc, char *argv[]) {
FILE *out = NULL;
FILE *fh = NULL;
// printf( "argc: %d\n", argc );
if( argc < 2 ) {
usage();
exit( 1 );
}
// printf( "File: %s\n", argv[1] );
file_name = argv[1];
fh = fopen( file_name, "rb" );
if( ! fh ) {
printf( "Could not open file '%s'\n", file_name );
exit( 1 );
}
int utf8_type = 1;
int utf8_1 = 0;
int utf8_2 = 0;
int utf8_3 = 0;
int utf8_4 = 0;
int byte_count = 0;
int expected_byte_count = 0;
int cin = fgetc( fh );
while( ! feof( fh ) ) {
switch( utf8_type ) {
case 1:
if( (cin & 0x80) ) {
if( (cin & 0xe0) == 0xc0 ) {
utf8_1 = cin;
utf8_type = 2;
byte_count = 1;
expected_byte_count = 2;
break;
}
if( (cin & 0xf0) == 0xe0 ) {
utf8_1 = cin;
utf8_type = 2;
byte_count = 1;
expected_byte_count = 3;
break;
}
if( (cin & 0xf8) == 0xf0 ) {
utf8_1 = cin;
utf8_type = 2;
byte_count = 1;
expected_byte_count = 4;
break;
}
inv_char();
utf8_type = 1;
break;
}
break;
case 2:
case 3:
case 4:
// printf( "utf8_type - %d\n", utf8_type );
// printf( "%c - %02x\n", cin, cin );
if( (cin & 0xc0) == 0x80 ) {
if( utf8_type == expected_byte_count ) {
utf8_type = 1;
break;
}
byte_count = utf8_type;
utf8_type++;
if( utf8_type == 5 ) {
utf8_type = 1;
}
break;
}
inv_char();
utf8_type = 1;
break;
default:
inv_char();
utf8_type = 1;
break;
}
if( cin == '\n' ) {
line_number ++;
char_number = 0;
}
if( out != NULL ) {
fputc( cin, out );
}
// printf( "lno: %d\n", line_number );
cin = fgetc( fh );
char_number++;
}
fclose( fh );
return 0;
}

... I'm trying to detect if a file has corrupted characters. I'm also
interested in deleting them.
This is easy with ugrep and takes just one line:
ugrep -q -e "." -N "\p{Unicode}" file.csv && echo "file is corrupted"
To remove invalid Unicode characters:
ugrep "\p{Unicode}" --format="%o" file.csv
The first command matches any character with -e "." except valid Unicode with -N "\p{Unicode}" that is a "negative pattern" to skip.
The second command matches a Unicode character "\p{Unicode}" and writes it with --format="%o".

I am probably repeating what other have said already. But i think your invalid characters get still printed because they may be valid. The Universal Character Set is the attempt to reference the worldwide frequently used characters to be able to write robust software which is not relying on a special character-set.
So i think your problem may be one of the following both - in assumption that your overall target is to handle this (malicious) input from utf-files in general:
There are invalid utf8 characters (better called invalid byte sequences - for this i'd like to refer to the corresponding Wikipedia-Article).
There are absent equivalents in your current display-font which are substituted by a special symbol or shown as their binary ASCII-equivalent (f.e. - i therefore would like to refer to the following so-post: UTF-8 special characters don't show up).
So in my opinion you have two possible ways to handle this:
Transform the all characters from utf8 into something handleable - f.e. ASCII - this can be done f.e. with iconv -f utf-8 -t ascii -o file_in_ascii.txt file_in_utf8.txt. But be careful transferring from one the wider character-space (utf) into a smaller one might cause a data loss.
Handle utf(8) correctly - this is how the world is writing stuff. If you think you might have to rely on ASCII-chars because of any limitating post-processing step, stop and rethink. In most cases the post-processor already supports utf, it's probably better to find out how to utilize it. You're making your stuff future- and bullet-proof.
Handling utf might seem to be tricky, the following steps may help you to accomplish utf-readyness:
Be able to display utf correctly or ensure that your display-stack (os, terminal and so on) is able to display an adequate subset of unicode (which, of course, should meet your needs), this may prevent the need of a hex-editor in many cases. Unfortunately utf is too big, to come in one font, but a good point to start at is this so-post: https://stackoverflow.com/questions/586503/complete-monospaced-unicode-font
Be able to filter invalid byte sequences. And there are many ways to achieve that, this ul-post shows a plenty variety of these ways: Filtering invalid utf8 - i want to especially point out the 4th answer which suggests to use uconv which allows you to set a callback-handler for invalid sequences.
Read a bit more about unicode.

A very dirty solution in python 3
import sys
with open ("cur.txt","r",encoding="utf-8") as f:
for i in f:
for c in i:
if(ord(c)<128):
print(c,end="")
The output should be:
>two_o~}}w~_^s?w}yo}

Using Ubuntu 22.04, I get more correct answer by using:
grep -axv -P '.*' file.txt
The original answer without the -P, seems to give false positives for a lot of asian characters, like:
<lei:LegalName xml:lang="ko">피씨에이생명보험주식회사</lei:LegalName>
<lei:LegalName xml:lang="ko">린드먼 부품소재 전문투자조합 1</lei:LegalName>
<lei:LegalName xml:lang="ko">비엔피파리바 카디프손해보험 주식회사</lei:LegalName>
These characters do pass the scanning of the isutf8 utility.

Related

Processing backspace control character (^H) in real time while logging sdout to file

I am working on a script to test new-to-me hard drives in the background (so I can close the terminal window) and log the outputs. My problem is in getting badblocks to print stdout to the log file so I can monitor its multi-day progress and create properly formatted update emails.
I have been able to print stdout to a log file with the following: (flags are r/w, % monitor, verbose)
sudo badblocks -b 4096 -wsv /dev/sdx 2>&1 | tee sdx.log
Normally the output would look like:
Testing with pattern 0xaa: 2.23% done, 7:00 elapsed. (0/0/0 errors)
No new-line character is used, the ^H control command backs up the cursor, and then the new updated status overwrites the previous status.
Unfortunately, the control character is not processed but saved as a character in the file, producing the above output followed by 43 copies of ^H, the new updated stats, 43 copies of ^H, etc.
Since the output is updated at least once per second, this produces a much larger file than necessary, and makes it difficult to retrieve the current status.
While working in terminal, the solution cat sdx.log && echo"" prints the expected/wanted results by parsing the control characters (and then inserting a carriage return so it is not immediately printed over by the next terminal line), but using cat sdx.log > some.file or cat sdx.log | mail both still include all of the extra characters (though in email they are interpreted as spaces). This solution (or ones like it which decode or remove the control character at the time of access still produce a huge, unnecessary output file.
I have worked my way through the following similar questions, but none have produced (at least that I can figure out) a solution which works in real time with the output to update the file, instead requiring that the saved log file be processed separately after the task has finished writing, or that the log file not be written until the process is done, both of which defeat the stated goal of monitoring progress.
Bash - process backspace control character when redirecting output to file
How to "apply" backspace characters within a text file (ideally in vim)
Thank you!
The main place I've run into this in real life is trying to process man pages. In the past, I've always used a simple script that post processes by stripping out the backspace appropriately. One could probably do this sort of thing in 80 character of perl, but here's an approach that handles backspace and cr/nl fairly well. I've not tested extensively, but it produces good output for simple cases. eg:
$ printf 'xxx\rabclx\bo\rhel\nworld\n' | ./a.out output
hello
world
$ cat output
hello
world
$ xxd output
00000000: 6865 6c6c 6f0a 776f 726c 640a hello.world.
If your output starts to have a lot of csi sequences, this approach just isn't worth the trouble. cat will produce nice human consumable output for those cases.
#include <assert.h>
#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
FILE * xfopen(const char *path, const char *mode);
off_t xftello(FILE *stream, const char *name);
void xfseeko(FILE *stream, off_t offset, int whence, const char *name);
int
main(int argc, char **argv)
{
const char *mode = "w";
char *name = strchr(argv[0], '/');
off_t last = 0, max = 0, curr = 0;
name = name ? name + 1 : argv[0];
if( argc > 1 && ! strcmp(argv[1], "-a")) {
argv += 1;
argc -= 1;
mode = "a";
}
if( argc > 1 && ! strcmp(argv[1], "-h")) {
printf("usage: %s [-a] [-h] file [ file ...]\n", name);
return EXIT_SUCCESS;
}
if( argc < 2 ) {
fprintf(stderr, "Missing output file. -h for usage\n");
return EXIT_FAILURE;
}
assert( argc > 1 );
argc -= 1;
argv += 1;
FILE *ofp[argc];
for( int i = 0; i < argc; i++ ) {
ofp[i] = xfopen(argv[i], mode);
}
int c;
while( ( c = fgetc(stdin) ) != EOF ) {
fputc(c, stdout);
for( int i = 0; i < argc; i++ ) {
if( c == '\b' ) {
xfseeko(ofp[i], -1, SEEK_CUR, argv[i]);
} else if( isprint(c) ) {
fputc(c, ofp[i]);
} else if( c == '\n' ) {
xfseeko(ofp[i], max, SEEK_SET, argv[i]);
fputc(c, ofp[i]);
last = curr + 1;
} else if( c == '\r' ) {
xfseeko(ofp[i], last, SEEK_SET, argv[i]);
}
}
curr = xftello(ofp[0], argv[0]);
if( curr > max ) {
max = curr;
}
}
return 0;
}
off_t
xftello(FILE *stream, const char *name)
{
off_t r = ftello(stream);
if( r == -1 ) {
perror(name);
exit(EXIT_FAILURE);
}
return r;
}
void
xfseeko(FILE *stream, off_t offset, int whence, const char *name)
{
if( fseeko(stream, offset, whence) ) {
perror(name);
exit(EXIT_FAILURE);
}
}
FILE *
xfopen(const char *path, const char *mode)
{
FILE *fp = fopen(path, mode);
if( fp == NULL ) {
perror(path);
exit(EXIT_FAILURE);
}
return fp;
}
You can delete the ^H
sudo badblocks -b 4096 -wsv /dev/sdx 2>&1 | tr -d '\b' | tee sdx.log
I have found col -b and colcrt usefull, but none worked perfect for me. These will apply control characters, not just drop them:
sudo badblocks -b 4096 -wsv /dev/sdx 2>&1 | col -b | tee sdx.log

how to make a folder named as sth that we read before in a text file by linux terminal?

I have a text file that contains a folder name.I want to read that text file's context via Ubuntu terminal and make a folder with that name(which is written in the txt file).I don't know what to do,I have made a file named "in.txt" that its context is "name" and i have tried the c code below:
#include <stdio.h>
int main(){
FILE *fopen (const char in.txt, const char r+);
}
What should i write in the terminal?
Let the text file in.txt be:
foldername1
foldername2
Then you can create a script file script.sh:
#!/usr/bin/env bash
# $1 will be replaced with the first argument passed to the script:
mkdir -p $(cat "$1")
Run the script in the shell as: ./script.sh in.txt. The folder names specified in in.txt should now be available. This assumes one file name per line in in.txt without space.
Lets say you have text file called in.txt which contains
Rambo
Johnny
Jacky
Hulk
Next you need to read word by word from file, for that you can use fscanf(). Once word is read, you need to create a folder with that word name. To create a folder command is mkdir. Next you need to execute mkdir from file for that you use either system() or execlp().
Here is the simple C program
int main(){
FILE *fp = fopen ("in.txt", "r");
if(fp == NULL) {
/* write some message that file is not present */
return 0;
}
char buf[100];
while(fscanf(fp,"%s",buf) > 0) {
/* buf contains each word of file */
/* now create that folder use execlp */
if(fork() ==0 )
execlp("/bin/mkdir","mkdir",buf,NULL); /* this will create folder with name available in file */
else
;
}
fclose(fp);
return 0;
}
Note that mkdir fails if directory already exists.
Compile like gcc -Wall test.c and execute it like ./a.out it will create the folders.
EDIT : Same If you want to using open() system call, there you won't any system calls which reads word by words like so you need to string manipulation.
int main(int argc, char *argv[]) {
int inputfd = open("in.txt",O_RDONLY);
perror("open");
if (inputfd == -1) {
exit(EXIT_FAILURE);
}
/* first find the size of file */
int pos = lseek(inputfd,0,2);
printf("pos = %d \n",pos);
/* againg make inputfd to point to beginning */
lseek(inputfd,0,0);
/*allocate memory equal to size of file, not random 1024 bytes */
char *buf = malloc(pos);
read(inputfd,buf,pos);/* read will read whole file data at a time */
/* you need to find the words from the buf, bcz buf contain whole data not one word */
char cmd[50];/* buffer to store folder name */
for(int row = 0,index = 0; buf[row]; row++) {
if(buf[row]!=' ' && buf[row]!='\n') {
cmd[index] = buf[row];
index++;
continue;
}
else {
cmd[index] = '\0';
index = 0;/* for next word, it should start from cmd[0] */
if(fork() == 0 )
execlp("/bin/mkdir","mkdir",cmd,NULL);
else ;
}
}
close(inputfd);
return 0;
}
Another option if you want to perform other commands besides mkdir
while read -r line; do mkdir -p $line && chown www-data:www-data $line; done < file.txt
You can use for loop from Linux terminal like:
for line in $(cat in.txt); do mkdir $line; done
That should work fine to create multiple files.

How to merge two lines in a CSV file

I have a CSV file that has the following issue.
line 1: "x","y","z","
line 2:
line 3: "," ",": "
This is a single line but has been written as multiple lines. Unfortunately, I cannot ask the provider to fix it. Is there a way we can fix it?
Read the file. Go trough every line and append your output to a variable. Then you have all lines after each other.
$output = '';
foreach($lines as $line) {
$output .= $line;
}
to read the csv file you can use fgetcsv. If line 1: is part of your csv file then split your output on the : and use the second part.
A small C program that counts the " and suppresses the \n if this count is odd. I doubt if this could be done in sed in fewer lines ...
/***
Usage ./a.out <the_file.csv > the_file.out
**/
#include <stdio.h>
int main(void)
{
int ch, state;
for(state=0; ;) {
ch= getc(stdin);
if (ch == EOF) break;
if (ch == '"') state++;
if (ch == '\n') {if (state %2) continue; else state=0; }
putc(ch, stdout);
}
if (state %2) {putc('"', stdout);
return 0;
}
Note: this will of course fail if there are escaped quotes (\") in the input file.

How to tell apart numeric scalars and string scalars in Perl?

Perl usually converts numeric to string values and vice versa transparently. Yet there must be something which allows e.g. Data::Dumper to discriminate between both, as in this example:
use Data::Dumper;
print Dumper('1', 1);
# output:
$VAR1 = '1';
$VAR2 = 1;
Is there a Perl function which allows me to discriminate in a similar way whether a scalar's value is stored as number or as string?
A scalar has a number of different fields. When using Perl 5.8 or higher, Data::Dumper inspects if there's anything in the IV (integer value) field. Specifically, it uses something similar to the following:
use B qw( svref_2object SVf_IOK );
sub create_data_dumper_literal {
my ($x) = #_; # This copying is important as it "resolves" magic.
return "undef" if !defined($x);
my $sv = svref_2object(\$x);
my $iok = $sv->FLAGS & SVf_IOK;
return "$x" if $iok;
$x =~ s/(['\\])/\\$1/g;
return "'$x'";
}
Checks:
Signed integer (IV): ($sv->FLAGS & SVf_IOK) && !($sv->FLAGS & SVf_IVisUV)
Unsigned integer (IV): ($sv->FLAGS & SVf_IOK) && ($sv->FLAGS & SVf_IVisUV)
Floating-point number (NV): $sv->FLAGS & SVf_NOK
Downgraded string (PV): ($sv->FLAGS & SVf_POK) && !($sv->FLAGS & SVf_UTF8)
Upgraded string (PV): ($sv->FLAGS & SVf_POK) && ($sv->FLAGS & SVf_UTF8)
You could use similar tricks. But keep in mind,
It'll be very hard to stringify floating point numbers without loss.
You need to properly escape certain bytes (e.g. NUL) in string literals.
A scalar can have more than one value stored in it. For example, !!0 contains a string (the empty string), a floating point number (0) and a signed integer (0). As you can see, the different values aren't even always equivalent. For a more dramatic example, check out the following:
$ perl -E'open($fh, "non-existent"); say for 0+$!, "".$!;'
2
No such file or directory
It is more complicated. Perl changes the internal representation of a variable depending on the context the variable is used in:
perl -MDevel::Peek -e '
$x = 1; print Dump $x;
$x eq "a"; print Dump $x;
$x .= q(); print Dump $x;
'
SV = IV(0x794c68) at 0x794c78
REFCNT = 1
FLAGS = (IOK,pIOK)
IV = 1
SV = PVIV(0x7800b8) at 0x794c78
REFCNT = 1
FLAGS = (IOK,POK,pIOK,pPOK)
IV = 1
PV = 0x785320 "1"\0
CUR = 1
LEN = 16
SV = PVIV(0x7800b8) at 0x794c78
REFCNT = 1
FLAGS = (POK,pPOK)
IV = 1
PV = 0x785320 "1"\0
CUR = 1
LEN = 16
There's no way to find this out using pure perl. Data::Dumper uses a C library to achieve it. If forced to use Perl it doesn't discriminate strings from numbers if they look like decimal numbers.
use Data::Dumper;
$Data::Dumper::Useperl = 1;
print Dumper(['1',1])."\n";
#output
$VAR1 = [
1,
1
];
Based on your comment that this is to determine whether quoting is needed for an SQL statement, I would say that the correct solution is to use placeholders, which are described in the DBI documentation.
As a rule, you should not interpolate variables directly in your query string.
One simple solution that wasn't mentioned was Scalar::Util's looks_like_number. Scalar::Util is a core module since 5.7.3 and looks_like_number uses the perlapi to determine if the scalar is numeric.
The autobox::universal module, which comes with autobox, provides a type function which can be used for this purpose:
use autobox::universal qw(type);
say type("42"); # STRING
say type(42); # INTEGER
say type(42.0); # FLOAT
say type(undef); # UNDEF
When a variable is used as a number, that causes the variable to be presumed numeric in subsequent contexts. However, the reverse isn't exactly true, as this example shows:
use Data::Dumper;
my $foo = '1';
print Dumper $foo; #character
my $bar = $foo + 0;
print Dumper $foo; #numeric
$bar = $foo . ' ';
print Dumper $foo; #still numeric!
$foo = $foo . '';
print Dumper $foo; #character
One might expect the third operation to put $foo back in a string context (reversing $foo + 0), but it does not.
If you want to check whether something is a number, the standard way is to use a regex. What you check for varies based on what kind of number you want:
if ($foo =~ /^\d+$/) { print "positive integer" }
if ($foo =~ /^-?\d+$/) { print "integer" }
if ($foo =~ /^\d+\.\d+$/) { print "Decimal" }
And so on.
It is not generally useful to check how something is stored internally--you typically don't need to worry about this. However, if you want to duplicate what Dumper is doing here, that's no problem:
if ((Dumper $foo) =~ /'/) {print "character";}
If the output of Dumper contains a single quote, that means it is showing a variable that is represented in string form.
You might want to try Params::Util::_NUMBER:
use Params::Util qw<_NUMBER>;
unless ( _NUMBER( $scalar ) or $scalar =~ /^'.*'$/ ) {
$scalar =~ s/'/''/g;
$scalar = "'$scalar'";
}
The following function returns true (1) if the input is numeric and false ("") if it is a string. The function also returns true (-1) if the input is a numeric Inf or NaN. Similar code can be found in the JSON::PP module.
sub is_numeric {
my $value = shift;
no warnings 'numeric';
# string & "" -> ""
# number & "" -> 0 (with warning)
# nan and inf can detect as numbers, so check with * 0
return unless length((my $dummy = "") & $value);
return unless 0 + $value eq $value;
return 1 if $value * 0 == 0; # finite number
return -1; # inf or nan
}
I don't think there is perl function to find type of value. One can find type of DS(scalar,array,hash). Can use regex to find type of value.

Is there a command to write random garbage bytes into a file?

I am now doing some tests of my application again corrupted files. But I found it is hard to find test files.
So I'm wondering whether there are some existing tools, which can write random/garbage bytes into a file of some format.
Basically, I need this tool to:
It writes random garbage bytes into the file.
It does not need to know the format of the file, just writing random bytes are OK for me.
It is best to write at random positions of the target file.
Batch processing is also a bonus.
Thanks.
The /dev/urandom pseudo-device, along with dd, can do this for you:
dd if=/dev/urandom of=newfile bs=1M count=10
This will create a file newfile of size 10M.
The /dev/random device will often block if there is not sufficient randomness built up, urandom will not block. If you're using the randomness for crypto-grade stuff, you can steer clear of urandom. For anything else, it should be sufficient and most likely faster.
If you want to corrupt just bits of your file (not the whole file), you can simply use the C-style random functions. Just use rnd() to figure out an offset and length n, then use it n times to grab random bytes to overwrite your file with.
The following Perl script shows how this can be done (without having to worry about compiling C code):
use strict;
use warnings;
sub corrupt ($$$$) {
# Get parameters, names should be self-explanatory.
my $filespec = shift;
my $mincount = shift;
my $maxcount = shift;
my $charset = shift;
# Work out position and size of corruption.
my #fstat = stat ($filespec);
my $size = $fstat[7];
my $count = $mincount + int (rand ($maxcount + 1 - $mincount));
my $pos = 0;
if ($count >= $size) {
$count = $size;
} else {
$pos = int (rand ($size - $count));
}
# Output for debugging purposes.
my $last = $pos + $count - 1;
print "'$filespec', $size bytes, corrupting $pos through $last\n";
# Open file, seek to position, corrupt and close.
open (my $fh, "+<$filespec") || die "Can't open $filespec: $!";
seek ($fh, $pos, 0);
while ($count-- > 0) {
my $newval = substr ($charset, int (rand (length ($charset) + 1)), 1);
print $fh $newval;
}
close ($fh);
}
# Test harness.
system ("echo =========="); #DEBUG
system ("cp base-testfile testfile"); #DEBUG
system ("cat testfile"); #DEBUG
system ("echo =========="); #DEBUG
corrupt ("testfile", 8, 16, "ABCDEFGHIJKLMNOPQRSTUVWXYZ ");
system ("echo =========="); #DEBUG
system ("cat testfile"); #DEBUG
system ("echo =========="); #DEBUG
It consists of the corrupt function that you call with a file name, minimum and maximum corruption size and a character set to draw the corruption from. The bit at the bottom is just unit testing code. Below is some sample output where you can see that a section of the file has been corrupted:
==========
this is a file with nothing in it except for lowercase
letters (and spaces and punctuation and newlines).
that will make it easy to detect corruptions from the
test program since the character range there is from
uppercase a through z.
i have to make it big enough so that the random stuff
will work nicely, which is why i am waffling on a bit.
==========
'testfile', 344 bytes, corrupting 122 through 135
==========
this is a file with nothing in it except for lowercase
letters (and spaces and punctuation and newlines).
that will make iFHCGZF VJ GZDYct corruptions from the
test program since the character range there is from
uppercase a through z.
i have to make it big enough so that the random stuff
will work nicely, which is why i am waffling on a bit.
==========
It's tested at a basic level but you may find there are edge error cases which need to be taken care of. Do with it what you will.
Just for completeness, here's another way to do it:
shred -s 10 - > my-file
Writes 10 random bytes to stdout and redirects it to a file. shred is usually used for destroying (safely overwriting) data, but it can be used to create new random files too.
So if you have already have a file that you want to fill with random data, use this:
shred my-existing-file
You could read from /dev/random:
# generate a 50MB file named `random.stuff` filled with random stuff ...
dd if=/dev/random of=random.stuff bs=1000000 count=50
You can specify the size also in a human readable way:
# generate just 2MB ...
dd if=/dev/random of=random.stuff bs=1M count=2
You can also use cat and head. Both are usually installed.
# write 1024 random bytes to my-file-to-override
cat /dev/urandom | head -c 1024 > my-file-to-override

Resources