Decompiling the GPL violated Linux kernel using Evolutionary Algorithms
TLDR: We want to decompile a binary code, into the byte equivalent C code. We look at this from an optimization viewpoint. We have a generated C code(or AST) and we want to optimize it so when we compile it, it is equivalent to the binary code, byte by byte. And I think it’s better to use a population based optimization metaheuristic to do this. Such as Genetic Programming.
Requirements to understand this post
The idea I’m writing about is very deep. You need to know the current problem we have about companies violating GPL and not releasing the Linux kernel source code used in their devices. Then you need to know what an Algorithm is. What a Heuristic is. And what is the difference between them. And you also need to know what an optimization problem is. And what is our goal in these problems. After that, you need to know in this very specific optimization problem, we are not looking for “good enough” solutions unlike most other optimization problems. We are rather looking for the “perfect” solution, if we can find one.
Backstory
Note: If you know about GPL violation which happens by not releasing the kernel source code, then you can skip this part.
Linux, the kernel, has been spread like a virus. Various hardware companies around the globe need a software to talk to their hardware and then provide APIs for the higher level programs. So their hardware can be of use. This includes mobile phone and tablet manufacturers. Android has become the de facto standard Operating System for a vast majority of mobiles and tablets. And Android works with the Linux system calls. The only kernel which is usable and provides these system calls, is Linux itself. So manufacturers take Linux, do a small amount of work to add support for their devices, and then they can be usable.
Linux is free and open source software with a very mature ecosystem. And by free, I mean both in the sense that it’s libre and that it’s zero cost(for them). It’s a great choice for companies so they won’t need to write a kernel from scratch. There are very few programmers who can write a kernel. Even less who also write standard, high quality, readable code. So that’s why they choose Linux. They get free dinner. They just need to bring spoon and fork to eat it.
But Linux is under a copyleft licence, it’s called GPL. Not just that. It’s also strong copyleft. In theory, this means if I have a device which a modified version of the Linux kernel has been used in it, I have the right to have its source code. Note that this right is mine. If you haven’t got such a device, you don’t have the right to the source code.
However, like almost always, they teach you one thing in theory. And in practice, you need to forget about all of them.
Since having the kernel source code is me right anyway, and they are not going to give it to me. I’m considering taking it meself. If we could somehow decompile the binary kernel code into the C code with the same semantic, then we are in much better position and closer to mainline my device.
What is the goal?
We have a kernel binary code extracted from me device. And we need to derive the readable and idiomatic C code which when compiled, it produces the same binary. The general term for this process is decompilation. However, this very specific decompilation has also a very specific trait. It is byte equivalent decompilation. That is, the C code we have, produces the same binary code as the binary code of the kernel we’ve got.
Theoretically, we could also go for finding the semantically equivalent C code. However, last time I researched, checking semantic equivalency is a very complex problem. I think it was NP hard. So we create less trouble for ourselves by checking for equivalence of either assembly code or byte code. We could also go for the IR, if possible.
How do we achieve it?
The short answer
I really don’t know. And I bet no one else does!
The long answer
Time needed
This is a research idea, and an idea for long term. Research itself can be seen as an Optimization problem! This means that you don’t know if this entire region has got something as perfect as you want it to be. Or if it has got one, let’s say a maximum, where is it or how long will it take for you to reach it. It might be right next to you. But you might not know which direction. Or it might be far far away.
Educated guess and previous related work
But we can make some plan and educated guesses. I’m not educated about decompiling nor about the kernel. But hey! It’s just another optimization problem! There is a research paper from few years ago which first coins the idea of byte equivalent decompiler as far as I know. They just wanted to show “feasibility” from what I’ve realized. Their method could decompile tiny(~20LoC) C programs. They had used Evolutionary Algorithms to reach this goal. And they hadn’t converted the binary to an IR.
Their method is using a database of relevant C code. Then using an EA to search in this database to construct a C code which when compiled, matches to the binary code. There are many limitations. One is that the compiler, its version and the flags used must be known. Also the paper assumed that the binary code can be constructed from their database of C code. In our case, this might not be entirely true.
Generating the initial population
Typically in EAs, the initial population is generated randomly. But here, we shouldn’t do that. We should instead use some (meta)heuristic to output some sloppy half correct C code. From which we generate our initial population.
The topic of Neural Decompilers has been hot among researchers. Recently, some employ Large Language Models to decompile. In whichever case, an important consideration is that if we want to use a neural model, it should’ve been trained on the kernel code. General models most likely won’t fit our purpose. And we shouldn’t focus too much on Neural Decompilers and forget everything else, considering NFL theorem.
There are also other heuristics or algorithms to decompile a binary. Reported by the BED paper, using decompiled C code from another “traditional” decompiler helped a lot.
Representing our C code
We probably have to use an AST as representative of the C code. It’s much more natural than working with C code directly. We could try to make our search space smaller by allowing only valid ASTs. For instance when a cross over or mutation is done, their outputs must be a valid C AST. There are other ways to limit search space. For instance, we could entirely remove some C features. I’m not advanced with C, but I think we could remove any loop other than while
. Also we could entirely let go of struct
s. We could then use some heuristic to detect where a bunch of pointers are actually a struct. Or where a while loop should’ve been a for
loop.
Representing the binary code
I have a feeling that the less layers between our Evolutionary Decompiler and the binary code, the better chance of success we’ll have. This means we should either use the Disassembled code, or the IR, if possible. If we could learn about the exact toolchain version, that goes a great way forward.
Why Evolutionary?
I have a feeling a population based metaheuristic is much better than a heuristic which travels one point at a time. However, it doesn’t necessarily have to be an EA. But one must consider that EAs have been in business as a population based metaheuristic for so long. Therefore different techniques are available and already researched.
Making the C code readable
After we have a decompiled C code, it is most likely unreadable and very un-idiomatic. We could have another round to find appropriate names for the variables, mutate the code to become more readable, find names for the structs and functions and so on. We could use a Language model for this and use already existing code database. The BED paper itself has cited some papers on this.
Also later, Genetic Improvement might be useful.
Other misc notes
I am thinking about mainlining me e-book reader. I have an Onyx Boox device. And the company actively refuses to release the kernel source code. But mainlining a recent enough Boox device might be easier than it seems:
- E-book reader devices use e-paper displays. And a vast majority of e-paper displays come from E-ink. There are other devices which use the same panel as me device. Like Kobo or Pocketbook. Kobo releases the kernel source code for the devices.
- Me Boox device uses a Snapdragon SoC. Snapdragon is quite well supported in mainline.
- An e-book reader device doesn’t the modem or the GPU to be working. Of course, mainlining the GPU can be very good if we could have 2D acceleration. But it is really not necessary, at least for me device which has got quite a good CPU.
Am I going to do it?
I really would like to. But first I should finish me wakegp research and eat its fruits. Then I could think about another research project. In the case someone else was up to it, I would be glad and appreciate it.