Finding data in large binary file and output with context - linux

Prologue / Context
Last week my root filesystem was remounted readonly serveral times and I took a complete snapshot via ddrescue. Sadly the filesystem was damaged already and some files are missing. At the moment I try to find my ejabberd user-database which should be somewhere within the image. Testdisk found the required file (marked as deleted) but could not restore it. Since the file is pretty small and I have a backup from some month ago I thought about doing a binary search over the whole image.
So now I have a 64GB file with a damaged filesystem and would like to extract some 4kb blocks which contain a certain pattern.
Question
How can I find the data within the 64GB large file and extract the result with some context (4kb)?
Since the filesystem image resides on my server I would prefer a linux cli tool.

The Tool
Since I couldn't find a tool which meet my requirements I wrote it myself in golang. I call it bima (for binary match). It isn't pretty but it did the job:
package main
import (
"bytes"
"encoding/hex"
"fmt"
"gopkg.in/alecthomas/kingpin.v1"
"io"
"log"
"math"
"os"
)
var (
debug = kingpin.Flag("debug", "Enable debug mode.").Short('d').Bool()
bsize = kingpin.Flag("blocksize", "Blocksize").Short('b').Default("126976").Int()
debugDetail = kingpin.Flag("debugdetail", "Debug Detail").Short('v').Default("10").Int()
matchCommand = kingpin.Command("match", "Match a value")
matchCommandValue = matchCommand.Arg("value", "The value (Hex Encoded e.g.: 616263 == abc)").Required().String()
matchCommandFile = matchCommand.Arg("file", "The file").Required().String()
)
func main() {
kingpin.Version("0.1")
mode := kingpin.Parse()
if *bsize <= 0 {
log.Fatal("The blocksize has to be larger than 0")
}
if *debugDetail <= 0 {
log.Fatal("The Debug Detail has to be larger than 0")
}
if mode == "match" {
searchBytes, err := hex.DecodeString(*matchCommandValue)
if err != nil {
log.Fatal(err)
}
scanFile(searchBytes, *matchCommandFile)
}
}
func scanFile(search []byte, path string) {
searchLength := len(search)
blocksize := *bsize
f, err := os.Open(path)
if err != nil {
log.Fatal(err)
}
defer f.Close()
fi, err := f.Stat()
if err != nil {
log.Fatal(err)
}
filesize := fi.Size()
expectedRounds := int(math.Ceil(float64(filesize-int64(searchLength))/float64(blocksize)) + 1)
if expectedRounds <= 0 {
expectedRounds = 1
}
data := make([]byte, 0, blocksize+searchLength-1)
data2 := make([]byte, 0, blocksize+searchLength-1)
offset := make([]byte, searchLength-1)
//reading the len of the slice or less (but not the cap)
readCount, err := f.Read(offset)
if err == io.EOF {
fmt.Println("The files seems to be empty")
return
} else if err != nil {
log.Fatal(err)
}
data = append(data, offset...)
buffer := make([]byte, blocksize)
var blockpos int
var idx int
blockpos = 0
lastLevel := -1
roundLevel := 0
idxOffset := 0
for round := 0; ; round++ {
if *debug {
roundLevel = ((round * 100) / expectedRounds)
if (roundLevel%*debugDetail == 0) && (roundLevel > lastLevel) {
lastLevel = roundLevel
fmt.Fprintln(os.Stderr, "Starting round", round+1, "of", expectedRounds, "--", ((round * 100) / expectedRounds))
}
}
//At EOF, the count will be zero and err will be io.EOF
readCount, err = f.Read(buffer)
if err != nil {
if err == io.EOF {
if *debug {
fmt.Fprintln(os.Stderr, "Done - Found EOF")
}
break
}
fmt.Println(err)
return
}
data = append(data, buffer[:readCount]...)
data2 = data
idxOffset = 0
for {
idx = bytes.Index(data2, search)
if idx >= 0 {
fmt.Println(blockpos + idxOffset + idx)
if idx+searchLength < len(data2) {
data2 = data2[idx+searchLength:]
idxOffset += idx
} else {
break
}
} else {
break
}
}
data = data[readCount:]
blockpos += readCount
}
}
The Story
For completeness here comes what I did to solve my problem:
At first I used hexedit to find out, that all db files have the same header. Encoded in hex it looks like this: 0102030463584d0b0000004b62574c41
So I used my tool to find all occurrences within my sda.image file:
./bima match 0102030463584d0b0000004b62574c41 ./sda.image >DBfiles.txt
For the 64GB this took about 8 Minutes and I think the HDD was the limiting factor.
The result where about 1200 occurrences which I extracted from the image with dd. As I didn't know the exact size of the files I simply extracted chunks of 20.000 bytes:
for f in $(cat DBfiles.txt); do
dd if=sda.image of=$f.dunno bs=1 ibs=1 skip=$f count=20000
done
Now I had about 1200 files and had to find the right ones. In a first step I search for the passwd files (passwd.DCD and passwd.DCL). later I did the same for the roster files. As the header of the files contains the name, I simply greped for passwd:
for f in *.dunno; do
if [ "$(cat $f | head -c 200 | grep "passwd" | wc -l)" == "1" ]; then
echo "$f" | sed 's/\.$//g' >> passwd_files.list
fi
done
Because the chunks were larger than the files I had to find the end of each files manually. I did the corrections with Curses Hexedit.
During that process I could see that the head of each file contained either dcl_logk or dcd_logk. So I knew which of the files were DCL files and which were DCD files.
In the end I had each file up to ten times and had to decide which version I wanted to use. In general I took the largest file. After putting the files in the DB directory of the new ejabberd server and restarting it, all accounts are back again. :-)

Related

Go program slowing down when increasing number of goroutines

I'm doing a small project for my parallelism course and I have tried it with buffered channels, unbuffered channels, without channels using pointers to slices etc. Also, tried to optimize it as much as possible (not the current state) but I still get the same result: increasing number of goroutines (even by 1) slows down the whole program. Can someone please tell me what I'm doing wrong and is even parallelism enhancement possible in this situation?
Here is part of the code:
func main() {
rand.Seed(time.Now().UnixMicro())
numAgents := 2
fmt.Println("Please pick a number of goroutines: ")
fmt.Scanf("%d", &numAgents)
numFiles := 4
fmt.Println("How many files do you want?")
fmt.Scanf("%d", &numFiles)
start := time.Now()
numAssist := numFiles
channel := make(chan []File, numAgents)
files := make([]File, 0)
for i := 0; i < numAgents; i++ {
if i == numAgents-1 {
go generateFiles(numAssist, channel)
} else {
go generateFiles(numFiles/numAgents, channel)
numAssist -= numFiles / numAgents
}
}
for i := 0; i < numAgents; i++ {
files = append(files, <-channel...)
}
elapsed := time.Since(start)
fmt.Printf("Function took %s\n", elapsed)
}
func generateFiles(numFiles int, channel chan []File) {
magicNumbersMap := getMap()
files := make([]File, 0)
for i := 0; i < numFiles; i++ {
content := randElementFromMap(&magicNumbersMap)
length := rand.Intn(400) + 100
hexSlice := getHex()
for j := 0; j < length; j++ {
content = content + hexSlice[rand.Intn(len(hexSlice))]
}
hash := getSHA1Hash([]byte(content))
file := File{
content: content,
hash: hash,
}
files = append(files, file)
}
channel <- files
}
Expectation was that by increasing goroutines the program would run faster but to a certain number of goroutines and at that point by increasing goroutines I would get the same execution time or a little bit slower.
EDIT: All the functions that are used:
import (
"crypto/sha1"
"encoding/base64"
"fmt"
"math/rand"
"time"
)
type File struct {
content string
hash string
}
func getMap() map[string]string {
return map[string]string{
"D4C3B2A1": "Libcap file format",
"EDABEEDB": "RedHat Package Manager (RPM) package",
"4C5A4950": "lzip compressed file",
}
}
func getHex() []string {
return []string{
"0", "1", "2", "3", "4", "5",
"6", "7", "8", "9", "A", "B",
"C", "D", "E", "F",
}
}
func randElementFromMap(m *map[string]string) string {
x := rand.Intn(len(*m))
for k := range *m {
if x == 0 {
return k
}
x--
}
return "Error"
}
func getSHA1Hash(content []byte) string {
h := sha1.New()
h.Write(content)
return base64.URLEncoding.EncodeToString(h.Sum(nil))
}
Simply speaking - the files generation code is not complex enough to justify parallel execution. All the context switching and moving data through the channel eats all benefit of parallel processing.
If you add something like time.Sleep(time.Millisecond * 10) inside the loop in your generateFiles function as if it was doing something more complex, you'll see what you expected to see - more goroutines work faster. But again, only until certain level, when extra work to do parallel processing overweights the benefit.
Note also, the execution time of the last bit of your program:
for i := 0; i < numAgents; i++ {
files = append(files, <-channel...)
}
directly depends on number of goroutines. Since all goroutines finish approximately at the same time, this loop almost never executed in parallel with your workers and the time it takes to run is simply added to the total time.
Next, when you append to files slice multiple times, it has to grow several times and copy the data over to the new location. You can avoid this by initially creating a slice that will fil all your resulting elements (luckily, you know exactly how many you'll need).

Weird MAC address format using "arp -a" tool on MacOS

I want to parse output from the arp -a command on MacOS, in Go using the net.ParseMAC function, however I'm getting an error due to the weird formatting.
Sample output from arp -a command:
> arp -a
? (192.168.1.1) at 0:22:7:4a:21:d5 on en0 ifscope [ethernet]
? (224.0.0.251) at 1:0:5e:0:0:fb on en0 ifscope permanent [ethernet]
? (239.255.255.250) at 1:0:5e:7f:ff:fa on en0 ifscope permanent [ethernet]
The MAC formatting is unexpected because instead of doing 01 and 00, the MAC addresses include just 1 and 0. It seems the formatting is allowed to be A:B:C:D:E:F instead of AA:BB:CC:DD:EE:FF.
How can I make the output in the latter format, so it can be accepted by the net.ParseMAC function?
Edit:
I made a simple Go function to solve the leaving off leading zeroes problem:
// FixMacOSMACNotation fixes the notation of MAC address on macOS.
// For instance: 1:0:5e:7f:ff:fa becomes 01:00:5e:7f:ff:fa
func FixMacOSMACNotation(s string) string {
var newString string
split := strings.Split(s, ":")
for i, s := range split {
if i != 0 {
newString += ":"
}
if len(s) == 1 {
newString += "0" + s
} else {
newString += s
}
}
return newString
}
Which can then be used in net.ParseMAC successfully.
This version uses strings.Builder to fix the given input as desired by OP.
A [strings.]Builder is used to efficiently build a string using Write methods. It minimizes memory copying. The zero value is ready to use. Do not copy a non-zero Builder.
package main
import (
"fmt"
"strings"
)
func main() {
inputs := []string{
"1:0:5e:7f:ff:fa",
"1:0:5e:7f:ff:f",
"1:0:5e:7f:ff:",
"1:0:5e::ff:",
}
for _, input := range inputs {
fmt.Println()
fmt.Println("FixMacOSMACNotation ", FixMacOSMACNotation(input))
}
}
// FixMacOSMACNotation fixes the notation of MAC address on macOS.
// For instance: 1:0:5e:7f:ff:fa becomes 01:00:5e:7f:ff:fa
func FixMacOSMACNotation(s string) string {
var e int
var sb strings.Builder
for i := 0; i < len(s); i++ {
r := s[i]
if r == ':' {
for j := e; j < 2; j++ {
sb.WriteString("0")
}
sb.WriteString(s[i-e : i])
sb.WriteString(":")
e = 0
continue
}
e++
}
for j := e; j < 2; j++ {
sb.WriteString("0")
}
sb.WriteString(s[len(s)-e:])
return sb.String()
}
try it here

Reading beyond buffer

I have a buffer of size bufferSize from which I read in chunks of blockSize, however, this yields some (to me) unexpected behavior, when the blockSize goes beyond the bufferSize.
I've put the code here:
http://play.golang.org/p/Ra2jicYHPu
Why does the second chunk only give 4 bytes? What's happening here?
I'd expect Read to always give the amount of bytes len(byteArray), and if it goes beyond the buffer, it'll handle that situation by setting the pointer in the buffer to after byteArray, and putting the rest of the buffer + whatever is beyond until the new buffer pointer.
Your expectations are not based on any documented behavior of bufio.Reader. If you want "Read to always give the amount of bytes len(byteArray)" you must use io.ReadAtLeast.
package main
import (
"bufio"
"fmt"
"io"
"strings"
)
const bufSize = 10
const blockSize = 12
func main() {
s := strings.NewReader("some length test string buffer boom")
buffer := bufio.NewReaderSize(s, bufSize)
b := make([]byte, blockSize)
n, err := io.ReadAtLeast(buffer, b, blockSize)
if err != nil {
fmt.Println(err)
}
fmt.Printf("First read got %d bytes: %s\n", n, string(b))
d := make([]byte, blockSize)
n, err = io.ReadAtLeast(buffer, d, blockSize)
if err != nil {
fmt.Println(err)
}
fmt.Printf("Second read got %d bytes: %s\n", n, string(d))
}
Playground
Output:
First read got 12 bytes: some length
Second read got 12 bytes: test string
1.see the code of buffio.NewReaderSize
func NewReaderSize(rd io.Reader, size int) *Reader {
// Is it already a Reader?
b, ok := rd.(*Reader)
if ok && len(b.buf) >= size {
return b
}
if size < minReadBufferSize {
size = minReadBufferSize
}
return &Reader{
buf: make([]byte, size),
rd: rd,
lastByte: -1,
lastRuneSize: -1,
}
}
strings.NewReader return a strings.Reader,so the buffer's (returned by bufio.NewReaderSize ) buf has minReadBufferSize(val is 16)
2.see code of bufio.Read
func (b *Reader) Read(p []byte) (n int, err error) {
……
copy(p[0:n], b.buf[b.r:])
b.r += n
b.lastByte = int(b.buf[b.r-1])
b.lastRuneSize = -1
return n, nil
}
copy src is b.buf[b.r:],when your first Read,b.r=12,……

What is the fastest way to generate a long random string in Go?

Like [a-zA-Z0-9] string:
na1dopW129T0anN28udaZ
or hexadecimal string:
8c6f78ac23b4a7b8c0182d
By long I mean 2K and more characters.
This does about 200MBps on my box. There's obvious room for improvement.
type randomDataMaker struct {
src rand.Source
}
func (r *randomDataMaker) Read(p []byte) (n int, err error) {
for i := range p {
p[i] = byte(r.src.Int63() & 0xff)
}
return len(p), nil
}
You'd just use io.CopyN to produce the string you want. Obviously you could adjust the character set on the way in or whatever.
The nice thing about this model is that it's just an io.Reader so you can use it making anything.
Test is below:
func BenchmarkRandomDataMaker(b *testing.B) {
randomSrc := randomDataMaker{rand.NewSource(1028890720402726901)}
for i := 0; i < b.N; i++ {
b.SetBytes(int64(i))
_, err := io.CopyN(ioutil.Discard, &randomSrc, int64(i))
if err != nil {
b.Fatalf("Error copying at %v: %v", i, err)
}
}
}
On one core of my 2.2GHz i7:
BenchmarkRandomDataMaker 50000 246512 ns/op 202.83 MB/s
EDIT
Since I wrote the benchmark, I figured I'd do the obvious improvement thing (call out to the random less frequently). With 1/8 the calls to rand, it runs about 4x faster, though it's a big uglier:
New version:
func (r *randomDataMaker) Read(p []byte) (n int, err error) {
todo := len(p)
offset := 0
for {
val := int64(r.src.Int63())
for i := 0; i < 8; i++ {
p[offset] = byte(val & 0xff)
todo--
if todo == 0 {
return len(p), nil
}
offset++
val >>= 8
}
}
panic("unreachable")
}
New benchmark:
BenchmarkRandomDataMaker 200000 251148 ns/op 796.34 MB/s
EDIT 2
Took out the masking in the cast to byte since it was redundant. Got a good deal faster:
BenchmarkRandomDataMaker 200000 231843 ns/op 862.64 MB/s
(this is so much easier than real work sigh)
EDIT 3
This came up in irc today, so I released a library. Also, my actual benchmark tool, while useful for relative speed, isn't sufficiently accurate in its reporting.
I created randbo that you can reuse to produce random streams wherever you may need them.
You can use the Go package uniuri to generate random strings (or view the source code to see how they're doing it). You'll want to use:
func NewLen(length int) string
NewLen returns a new random string of the provided length, consisting of standard characters.
Or, to specify the set of characters used:
func NewLenChars(length int, chars []byte) string
This is actually a little biased towards the first 8 characters in the set (since 255 is not a multiple of len(alphanum)), but this will get you most of the way there.
import (
"crypto/rand"
)
func randString(n int) string {
const alphanum = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
var bytes = make([]byte, n)
rand.Read(bytes)
for i, b := range bytes {
bytes[i] = alphanum[b % byte(len(alphanum))]
}
return string(bytes)
}
If you want to generate cryptographically secure random string, I recommend you to take a look at this page. Here is a helper function that reads n random bytes from the source of randomness of your OS and then use these bytes to base64encode it. Note that the string length would be bigger than n because of base64.
package main
import(
"crypto/rand"
"encoding/base64"
"fmt"
)
func GenerateRandomBytes(n int) ([]byte, error) {
b := make([]byte, n)
_, err := rand.Read(b)
if err != nil {
return nil, err
}
return b, nil
}
func GenerateRandomString(s int) (string, error) {
b, err := GenerateRandomBytes(s)
return base64.URLEncoding.EncodeToString(b), err
}
func main() {
token, _ := GenerateRandomString(32)
fmt.Println(token)
}
Here Evan Shaw's answer re-worked without the bias towards the first 8 characters of the string. Note that it uses lots of expensive big.Int operations so probably isn't that quick! The answer is crypto strong though.
It uses rand.Int to make an integer of exactly the right size len(alphanum) ** n, then does what is effectively a base conversion into base len(alphanum).
There is almost certainly a better algorithm for this which would involve keeping a much smaller remainder and adding random bytes to it as necessary. This would get rid of the expensive long integer arithmetic.
import (
"crypto/rand"
"fmt"
"math/big"
)
func randString(n int) string {
const alphanum = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
symbols := big.NewInt(int64(len(alphanum)))
states := big.NewInt(0)
states.Exp(symbols, big.NewInt(int64(n)), nil)
r, err := rand.Int(rand.Reader, states)
if err != nil {
panic(err)
}
var bytes = make([]byte, n)
r2 := big.NewInt(0)
symbol := big.NewInt(0)
for i := range bytes {
r2.DivMod(r, symbols, symbol)
r, r2 = r2, r
bytes[i] = alphanum[symbol.Int64()]
}
return string(bytes)
}

How to get CPU usage

My Go program needs to know the current cpu usage percentage of all system and user processes.
How can I obtain that?
Check out this package http://github.com/c9s/goprocinfo, goprocinfo package does the parsing stuff for you.
stat, err := linuxproc.ReadStat("/proc/stat")
if err != nil {
t.Fatal("stat read fail")
}
for _, s := range stat.CPUStats {
// s.User
// s.Nice
// s.System
// s.Idle
// s.IOWait
}
I had a similar issue and never found a lightweight implementation. Here is a slimmed down version of my solution that answers your specific question. I sample the /proc/stat file just like tylerl recommends. You'll notice that I wait 3 seconds between samples to match top's output, but I have also had good results with 1 or 2 seconds. I run similar code in a loop within a go routine, then I access the cpu usage when I need it from other go routines.
You can also parse the output of top -n1 | grep -i cpu to get the cpu usage, but it only samples for half a second on my linux box and it was way off during heavy load. Regular top seemed to match very closely when I synchronized it and the following program:
package main
import (
"fmt"
"io/ioutil"
"strconv"
"strings"
"time"
)
func getCPUSample() (idle, total uint64) {
contents, err := ioutil.ReadFile("/proc/stat")
if err != nil {
return
}
lines := strings.Split(string(contents), "\n")
for _, line := range(lines) {
fields := strings.Fields(line)
if fields[0] == "cpu" {
numFields := len(fields)
for i := 1; i < numFields; i++ {
val, err := strconv.ParseUint(fields[i], 10, 64)
if err != nil {
fmt.Println("Error: ", i, fields[i], err)
}
total += val // tally up all the numbers to get total ticks
if i == 4 { // idle is the 5th field in the cpu line
idle = val
}
}
return
}
}
return
}
func main() {
idle0, total0 := getCPUSample()
time.Sleep(3 * time.Second)
idle1, total1 := getCPUSample()
idleTicks := float64(idle1 - idle0)
totalTicks := float64(total1 - total0)
cpuUsage := 100 * (totalTicks - idleTicks) / totalTicks
fmt.Printf("CPU usage is %f%% [busy: %f, total: %f]\n", cpuUsage, totalTicks-idleTicks, totalTicks)
}
It seems like I'm allowed to link to the full implementation that I wrote on bitbucket; if it's not, feel free to delete this. It only works on linux so far, though: systemstat.go
The mechanism for getting CPU usage is OS-dependent, since the numbers mean slightly different things to different OS kernels.
On Linux, you can query the kernel to get the latest stats by reading the pseudo-files in the /proc/ filesystem. These are generated on-the-fly when you read them to reflect the current state of the machine.
Specifically, the /proc/<pid>/stat file for each process contains the associated process accounting information. It's documented in proc(5). You're interested specifically in fields utime, stime, cutime and cstime (starting at the 14th field).
You can calculate the percentage easily enough: just read the numbers, wait some time interval, and read them again. Take the difference, divide by the amount of time you waited, and there's your average. This is precisely what the top program does (as well as all other programs that perform the same service). Bear in mind that you can have over 100% cpu usage if you have more than 1 CPU.
If you just want a system-wide summary, that's reported in /proc/stat -- calculate your average using the same technique, but you only have to read one file.
You can use the os.exec package to execute the ps command and get the result.
Here is a program issuing the ps aux command, parsing the result and printing the CPU usage of all processes on linux :
package main
import (
"bytes"
"log"
"os/exec"
"strconv"
"strings"
)
type Process struct {
pid int
cpu float64
}
func main() {
cmd := exec.Command("ps", "aux")
var out bytes.Buffer
cmd.Stdout = &out
err := cmd.Run()
if err != nil {
log.Fatal(err)
}
processes := make([]*Process, 0)
for {
line, err := out.ReadString('\n')
if err!=nil {
break;
}
tokens := strings.Split(line, " ")
ft := make([]string, 0)
for _, t := range(tokens) {
if t!="" && t!="\t" {
ft = append(ft, t)
}
}
log.Println(len(ft), ft)
pid, err := strconv.Atoi(ft[1])
if err!=nil {
continue
}
cpu, err := strconv.ParseFloat(ft[2], 64)
if err!=nil {
log.Fatal(err)
}
processes = append(processes, &Process{pid, cpu})
}
for _, p := range(processes) {
log.Println("Process ", p.pid, " takes ", p.cpu, " % of the CPU")
}
}
Here is an OS independent solution using Cgo to harness the clock() function provided by C standard library:
//#include <time.h>
import "C"
import "time"
var startTime = time.Now()
var startTicks = C.clock()
func CpuUsagePercent() float64 {
clockSeconds := float64(C.clock()-startTicks) / float64(C.CLOCKS_PER_SEC)
realSeconds := time.Since(startTime).Seconds()
return clockSeconds / realSeconds * 100
}
I recently had to take CPU usage measurements from a Raspberry Pi (Raspbian OS) and used github.com/c9s/goprocinfo combined with what is proposed here:
Accurate calculation of CPU usage given in percentage in Linux?
The idea comes from the htop source code and is to have two measurements (previous / current) in order to calculate the CPU usage:
func calcSingleCoreUsage(curr, prev linuxproc.CPUStat) float32 {
PrevIdle := prev.Idle + prev.IOWait
Idle := curr.Idle + curr.IOWait
PrevNonIdle := prev.User + prev.Nice + prev.System + prev.IRQ + prev.SoftIRQ + prev.Steal
NonIdle := curr.User + curr.Nice + curr.System + curr.IRQ + curr.SoftIRQ + curr.Steal
PrevTotal := PrevIdle + PrevNonIdle
Total := Idle + NonIdle
// fmt.Println(PrevIdle, Idle, PrevNonIdle, NonIdle, PrevTotal, Total)
// differentiate: actual value minus the previous one
totald := Total - PrevTotal
idled := Idle - PrevIdle
CPU_Percentage := (float32(totald) - float32(idled)) / float32(totald)
return CPU_Percentage
}
For more you can also check https://github.com/tgogos/rpi_cpu_memory

Resources