Background:

Strings is a Linux command which finds ASCII sequences within a file. This just so happens to be a very important part of the analysis of a binary file as a compiler will leave artifacts within the file in plain text. This can tell you the language used to build it, the compiler used, the names of variables and functions, and more. It’s also a very simple starting point of the analyse phase of reverse engineering, and was a obvious first step to do.

Theory:

Like all programming problems, there are countless ways to find all the strings in a given binary file, one way is that you can assume every value within a file between 1 and 127 inclusive is an ascii character, and that groups of characters 3 or more in length are strings.

// buf is a u8 array containing a filebuffer loaded into memory. 
for (buf, 0..) |val, i| {
    _ = i;
    if (val <= 127 and val > 0) {
        try string.append(val);
    } else if (string.items.len >= 3) {
        std.debug.print("{s}\n", .{string.items});
        string.clearAndFree();
    } else {
        string.clearAndFree();
    }
}
if (string.items.len >= 3) {
    std.debug.print("{s}\n", .{string.items});
}

However, the issues with this method are twofold:

  1. It is very inefficient, taking up lots of memory and compute.
  2. It has a lot of false positives, thinking portions of a binary file to be strings when they clearly are not.

To address the first issue, we can find slices of the buffer instead of creating new arraylists:

// buf is a u8 array containing a filebuffer loaded into memory. 
// filesize is an integer representing the size of the filebuffer. 

var start: ?usize = null;
for (buf, 0..) |val, i| {
    if (val <= 127 and val > 0) {
        if (start == null) {
            start = i;
        }
    } else if (start != null and i - start.? >= 3) {
        try stdout.print("{s}\n", .{buf[start.?..i]});
        start = null;
    } else {
        start = null;
    }
}
if (start != null) {
    std.debug.print("{s}\n", .{buf[start.?..file.size]});
} 

In this case, I am taking advantage of Zig’s null union type which allows to set a value to both null and an integer. In this case, it is quite continent to keeping track of where I am starting and ending.

However, this still does not address the compute time needed, nor the false positives, for that, I turned to calculating entropy.

Entropy

Entropy is a measure of the randomness of a given function, with, in general, text having an entropy of about 0.6. This means we should be able to remove many false positives by calculating the entropy of a given string candidate. Entropy can be described as follows:

- 𝛴 pi(x) * log2(pi(x))

where pi is the probability of the occurrence of i, or, for our purposes, the mean of how many times a 1 occurs in a given set of bytes. This can be accomplished as follows:

pub fn entropy(string: []const u8) f16 {
    var sum: u64 = 0;
    // The entropy function is as follows:
    // 𝛴 p(x) * log2(p(x))
    // Where p(x) is the probability of the occurance of x.
    // In other words, we are averages the number of 1's
    // occuring in a 16 byte segment of a file:
    // p(x) =  𝛴 (@popCount(val))/(16*8).
    for (string, 0..) |val, j| {
        _ = j;
        sum += @popCount(val);
    }
    if (sum == 0) {
        return 0;
    }
    var p: f16 = @as(f16, @floatFromInt(sum)) / (@as(f16, @floatFromInt(string.len * 8)));
    return (-1 * (p * @log2(p)));
}

Here I am making heavy use of atomics to improve performance and clean up code. I also am using the smallest available float size of f16 to reduce memory usage and improve performance. What’s nice is, we can now use this to simply plug into our previous algorithm to reduce false positives:

// buf is a u8 array containing a filebuffer loaded into memory. 
// filesize is an integer representing the size of the filebuffer. 

var start: ?usize = null;
for (buf, 0..) |val, i| {
    if (val <= 127 and val > 0) {
        if (start == null) {
            start = i;
        }
    } else if (start != null and i - start.? >= 3) {
        const localEntropy = entropy(file.buf[start.?..i]);
        if (localEntropy <= 0.55 and localEntropy >= 0.1) {
            try stdout.print("{s}\n", .{buf[start.?..i]});
        }
        start = null;
    } else {
        start = null;
    }
}
if (start != null) {
    std.debug.print("{s}\n", .{buf[start.?..file.size]});
}

With all of this put together, I can get a fairly simple, but reasonable effective, strings command which can be used for further analysis of a binary file. You can see the final product here.