الأحد، 21 أغسطس 2011

How Kalimat produces EXE files


Everyone tells me Kalimat is a toy language since it doesn't produce .exe files.

At first, I didn't pay attention since my main goal was teaching children programming. I mean, does Smallbasic produce exe's? Does Scratch?

But gradually, I changed my mind
  • Many children would feel patronized if they have a perception of being taught with a "kiddy" language, even if the language is actually powerful - if it seems kiddy, that's bad.
  • Being an Arabic-based language means it's under more scrutiny, since a lot of people will have the "Arabs can't make a real language" point of view, and will find any reason to say so
  • There is a real technical need for making .exe files from programs, so that users - kid or adult - can distribute their programs to others.
So started the journey of making executables. I began considering my options:
  1. Generate assembly code or machine code from Kalimat, perhaps using something like LLVM or C--
  2. Generate code in another language like C++ or Go, and use e.g a C++ compiler to create the .exe
  3. Cheat
Cheating sounds good, right? What does that mean exactly? Well, in early versions of Visual Basic (far before VB6 or .Net) the IDE could create exe files, but not exactly the way you know: The file contained a bytecode version of your program, and you had to include a DLL that came with VB and contained an interpreter for this bytecode. All your exe had to do was to load the DLL and tell it: "Here, take this program and run it for me, will you?".

This is also how py2exe works: It bundles your python program and a python interpreter into one package, and that is your executable.

Kalimat has already taken a lot of ideas from Basic and Python, so I decided to go this route and quickly add that feature, and in the long term consider adding the capability of making real, 'respectful' .exe files.

(I do mean 'quickly', it was done in ~ 3 days).

Step 1: Separate SmallVM into its own DLL

The Kalimat IDE and SmallVM (the virtual machine that runs Kalimat programs) were very tightly coupled in the source code. I had to spend some times moving all runtime code from the IDE to the VM, making small changes as I go, and export some VM functions.

Now I have an independent smallvm.dll which exports a function that your programs can send code to execute.

Even better: smallvm.dll does not take Kalimat code, but takes code in the form of its own assembly. That means if you're creating your own programming language you can use it.

Step 2: Generate the "driver" program

Now suppose the user typed this program and wants to create an .exe from it:
اطبع 12
First the Kalimat IDE will generate this assembly:

.method main
pushv 12
callex print
ret
.endmethod
This is good. Now we need a program that does something like this:
#include "smallvm.h"

int main( )
{
char *program =".method main\npushv 12\n\callex print\nret\n.endmethod";
SmallVMRunCode(program);
}
If this program is then made to .exe, then we're done!

Notice that I've simplified a lot of details here. For example the char *program is actually not a direct representation of the program but a base64 encoding of it. Also notice that I could've used a technique called 'binary blobs' to bypass the need for repeatedly compiling C++ code and just use a linker to combine object files.

So far so good, but that means I need to include a C or C++ compiler (or a linker) with Kalimat. On the Linux version of Kalimat that's easy: Just add gcc or g++ as a dependency and the package manager would take care of the job.

On Windows I'd have to manually bundle a compiler. The standard Open Source C++ compiler on Windows is MinGW. Its a little more than 120 megabytes...

Ouch. Remember that Kalimat's download is currently about 5.4 megabytes.

I tried to take only the necessary files in MinGW and include them, but failed. I don't know what little stuff depends on other little stuff.. It might be possible, even easy, but I don't want to keep trying out things aimlessly, and I don't want to study - right now- the structure of the GNU toolchain. Let's find another way.

No problem, I thought: I'll use Google's Go language. The compiler and linker (8g.exe, 8l.exe) are 1.8 megabytes together, and they don't need anything else to work. Excellent! All I need to do is generate a small Go program that calls a function from a C dll.

To do this, I think you use a tool called cgo that's bundled with Go. I tried for some time to use cgo but failed. I didn't spend a long time doing that; maybe I'm too lazy, maybe if I spent a little more time I'd have figured it out, but anyway...

What other languages produce native .exe's these days? I know: Free Pascal.

I'll spoil the surprise for you: this is the current solution. Yup! Good ole' Pascal :)

At first, the generated .pas file looked something like this:
program RunSmallVM;

procedure RunSmallVMCodeBase64(A:PChar;B:PChar);
stdcall ;external 'smallvm.dll';
begin
RunSmallVMCodeBase64('','2e6d657468');
end.
This is good as long as your encoded program is small. Once it gets a little large you find out that traditional Pascal string can't go more than 255 characters. What?

Ok, you can add a compiler directive to make the language use another type of string (AnsiStrings), but string literals maintain the 255 character limit. Sigh :(

No problem: I made the code generator make a series of string concatenations to form the final program form. This would slow down the time from loading the .exe to running the program, but it now works. I can speed things up later, by embedding binary blobs in the .exe or something.

Also I had trouble with base64 encoding of programs: Parts of the SmallVM assembly program are themselves encoded in base64. It seems the encoding in this case messes up. This is what happened with me:

programHeader = encode64(stuff)
program = programHeader + restOfCode
stringToSend = encode64(program)

originalProgram = decode64(stringToSend)

To my surprise, the string originalProgram is not equal to programHeader+restOfCode as expectd, but instead it is equal to stuff+restOfCode. It seems the base64 decoder is too eager to decode anything that seems like base64 characters :(

As a hack I used different functions in the .DLL to send different parts of the program. I'll figure out a proper solution later.

There is a lesson to be learned here: It pays to diversify your knowledge! In order to create an actual useful product, I went through a journey of old and new technology: VB, py2exe, MinGW, Go, and even Pascal. You don't know what knowledge will finally solve the problem.

So it's buggy, it's hacky, it's unstable, but it's there! Kalimat can now generate .exe files! And with a few iterations I hope it works well enough for day to day usage.

هناك تعليقان (2):

ياسر يقول...

مقال رائع ولكن أشعر أن الطريقة التي اتبعتها طويلة...
ألا يمكنك مثلاً جعل مفسر كلمات ملف exe بدلاً من dll ثم تقوم بإدراج البرنامج (بأي صورة كانت) ضمن هذا الملف بحيث يقرأ المفسر الكود من نفس الملف الذي هو فيه ؟

Mohamed Samy يقول...

@ياسر

ربما كان يمكن هذا، لكن لا أشعر أنه هناك فرق في حجم الخطوات بين هذه الطريقة والطريقة التي استخدمتها.

وطريقة SmallVM.dll لها مميزاتها على أية حال: يمكن مثلاً للمستخدم تنزيل إصدارة أسرع أو أكثر ثباتاً من SmallVM dll ووضعها في فهرس برنامجه بحيث يتحسن أداء البرنامج فقط بنسخ ملف، بدون الحاجة لإعادة إنتاج الexe من الكود المصدرية.