Let’s get the “basics” down. <\/strong><\/h2>\n\n\n\n<\/p>\n\n\n\n
So what really even is reverse engineering? Some areas, such as, static analysis, dynamic analysis, cryptanalysis and decompilation, solely revolve around inspecting the operations themselves as opposed to looking at the data. The data in this case can be something like the plaintext message from an cryptographic function, or the data used by an application at runtime, for example some game assets in a game. These types of RE can be classified separately for this reason, and can be referred to as binary file format RE for example in the last example about game assets, or just “data reverse engineering<\/strong>“. <\/p>\n\n\n\nIn a way, it’s as though there are 2 distinct routes of reversing: software reverse engineering or data reverse engineering. This permiates throughout the field. Static analysis<\/em> is the act of inspecting the software, but simply without any of its data<\/strong> around. Whereas, in contrast, dynamic analysis<\/em> refers to inspecting the software, but with the execution of the software, the data is then also able to affect the application, thus allowing for further, detailed inspection of the software as it is when it’s running normally. <\/p>\n\n\n\n<\/p>\n\n\n\n
In the realms of video games, data is represents a large majority of the footprint. In the early years, many novel techniques were employed to get around hardware limitations present on video game consoles at the time and this inspired a lot of different file formats. As games require a lot of assets, models, textures, sound files etc, there’s almost always a need to keep things organised in a way that the game can read the assets as and when it’s necessary to. So, just like zip files are a handy way for you to archive your own personal data, game archive file formats were a big thing in the early years. Some games, who may have been produced by the same developer, may even use the same file format for all of their games. <\/p>\n\n\n\n
This first post will go over a simple archive file format I encountered while doing some RE work as part of a release for Vectorman (Unreleased) (for PS2), the post for that will go over some of the more interesting technical aspects of the game but I thought this would be a good example to show how to cover something like an archive file format. <\/p>\n\n\n\n
Finding the target<\/h2>\n\n\n\n The game is a 3D platformer with a simple frontend, multiple levels, weapons, AI to fight and some objectives to complete. Armed with this knowledge, I knew that there would be at least a collection of 3D models, textures, audio and possibly game logic scripts, so I set out to find where these are stored, so that they can be inspected. The root directory contained only the executable, a file named “SMALLF.DAT”, a disc-configuration file necessary for the platform, and 2 subdirectories which only contained some music files and a few textures.<\/p>\n\n\n\n <\/figure>\n\n\n\nThis is a sign that there’s an archive format being used somewhere. While we can see that there are some assets present, it doesn’t seem like everything is there, which is usually a sign that there’s an archive somewhere being used. In this case, the obvious sign here is “SMALLF.DAT”, which, when we open up in a simple hex editor, we can see something interesting:<\/p>\n\n\n\n <\/figure>\n\n\n\nIt looks like this is our target, we have a collection of filenames and then the rest of the file just seems to be “random data”, and a quick look over other builds we have access to, it seems like this file is present on every build of the game.<\/p>\n\n\n\n
<\/p>\n\n\n\n
The Method<\/h2>\n\n\n\n So first off, we need to check over what exactly we have in that first part of the file. It seems to contain a collection of paths, but there’s also some bytes which aren’t zero and the first 5(?!) bytes seem to be unique.<\/p>\n\n\n\n <\/figure>\n\n\n\nStarting at those first 5 bytes then, we can see what they represent in decimal… 0x43 is decimal 67. Hmm, well let’s count how many paths we have… <\/p>\n\n\n\n <\/figure>\n\n\n\nLooks like there are 66 total entries, so maybe this byte represents how many entries there are! <\/p>\n\n\n\n
Moving over to offset 0x1, we see bytes 05 00 00<\/code>. Can’t really make much sense of these at this point, so we’ll just move on to offset 0x4, instead. <\/p>\n\n\n\nHere at 0x4, we have a single byte, with a value of 13<\/code>. What follows then is the path, and we can see that it ends with a byte of 00<\/code>, in every instance.. and this pattern becomes evident throughout this entire first half of the file…<\/p>\n\n\n\n <\/figure>\n\n\n\nThis is an extremely basic overview of how the pattern can start to emerge. <\/p>\n\n\n\n
In the C programming language you usually always have some sort of idea as to how much data you might need at a given time, or need in order to perform a certain operation and for strings, that holds true too, as a string in C is just a collection of bytes, followed by a null-byte (00<\/code>) to denote the end of the string. The pattern which is emerging above seems to suggest that each entry has a single byte before the path, which gives possible values between 0-255. In early versions of Windows and other operating systems, files\/paths had a hard-limit of 255 characters, called MAX_PATH<\/code>. So.. maybe 13<\/code> is representing the number of characters for the path? <\/p>\n\n\n\n13<\/code> is decimal 19 and the path associated with it is “piLaunch\\combat.bte”.. which is exactly 19 characters long! The pattern seems to be getting a lot clearer now and we map this out a bit better.<\/p>\n\n\n\nIt seems as though the first 4 bytes represent an int32_t<\/code>. This stores data necessary for reading the actual “table of contents” and we refer to it as the header, so we can ignore that for now, let’s concentrate on what is stored for each entry in the table.<\/p>\n\n\n\n <\/figure>\n\n\n\nWhat’s left after the path (and the null-byte!) is 4 unique bytes. I call them unique bytes because it seems that each entry in the table has a unique value here each time. But let’s take what we have right now and try to make some sense out of it. <\/p>\n\n\n\n
struct Header {\n\tchar num_entries;\n\tchar unknown_values[3];\n};\nstruct Entry {\n char path_size;\n\tstring path;\n\tint unknown;\n};\nstruct File {\n Header hdr;\n std::vector<Entry> entries;\n};<\/code><\/pre>\n\n\n\nIn essence, this is the pattern that we can see, but represented in C. <\/p>\n\n\n\n
This is ultimately the heart of data reverse engineering, we have to take what we know about the data and take any relevant aspects into consideration and try to apply reasoning to the data that we come across. <\/p>\n\n\n\n
The remaining pieces of data which are interesting are those 4 bytes which come after each path. Within data reversing, you tend to want to follow the data which correlates to standard C data types firstly before then moving onto the more interesting stuff (for example those “unknown” bytes in Header<\/code>).<\/p>\n\n\n\n<\/p>\n\n\n\n
Taking a look at the values then, we can notice a bit of a pattern with them: they only ever seem to increase<\/em><\/strong>.<\/p>\n\n\n\n <\/figure>\n\n\n\nThis is interesting, because usually as stated earlier, in the C programming language, you usually always either know <\/em>or have to <\/em>know how much data you need to be working with and these values stick out because the very first value, 4535, seems small and the table of contents ends at roughly 0x530 (or decimal 1328). <\/p>\n\n\n\nAt this point we could summise that it’s somehow the size of the data itself, or perhaps the offset within the file where this file data is located? <\/p>\n\n\n\n
Heading to the location 4535 seems to suggest that it’s not the start location (or “offset”) of the data itself, and it doesn’t seem to be the size of the data, either… but if we were start reading from the end of the header up to location 4535, that seems to make more sense!<\/p>\n\n\n\n
It’s hard to tell for sure, since the first entry itself is (another) custom binary file for some sort of game asset, so it’s hard to definitively call it here. However, ironically, we can quite literally perform this investigation in the opposite (or reverse<\/em>) direction and check the last entry in the file, which is conveniently a regular text file! <\/p>\n\n\n\nIn this case the entry is named “materials.dat” and sure enough, the last bytes of the data seem to be some sort of configuration file for the materials used in the game in plain-text! And checking the 4-byte value.. it perfectly aligns!<\/p>\n\n\n\n <\/figure>\n\n\n\nThe 4-byte value for this entry is 197514 and that is exactly after the bytes 0D 0A<\/code>, which are linefeed and carriage return bytes, denoting the end of the file. All that is left is a bunch of zeroes which seems to pad out the rest of the file up to a certain size, which is a common practice for games of this era which commonly created these file formats with specific specifications in mind so that they could be loaded into memory blocks of varying sizes. <\/p>\n\n\n\nFinishing up<\/h2>\n\n\n\n At this point, before continuing, we need to double-check our work and sanity check everything to make sure we are able to fully interpret everything. One thing that is still sticking out is the “header” and those bytes which we don’t really understand yet. However, at this point we have more than enough information to not only extract this data, but also to write these files ourselves (as for the time being we can simply try to reuse any known values and see if those work).<\/p>\n\n\n\n
Extraction of the data seemed great.. except there was one fundamental error<\/strong>: The first byte is not a single byte representing the number of entries, as other samples of “SMALLF.DAT” revealed a different value each time. Upon further inspection, the actual header is just 4 bytes, in this case, 1347.. which is the start location<\/strong> of the file data<\/strong> itself (also, the end location <\/strong>of the header<\/strong>… or the size <\/strong>of the header<\/strong>!). <\/p>\n\n\n\nAnd with this, we are able to fully interpret and thus extract each entry within the file:<\/p>\n\n\n\n
class SmallFile {\npublic:\n\tstruct Entry {\n\t\tstring path;\n\t\tint offs = 0;\n\n\t\tvector<unsigned char> data;\n\t};\n\n\tSmallFile() = default;\n\tSmallFile(fs::path in) {\n\t\tread(in);\n\t}\n\n\tvoid read(fs::path in)\n\t{\n\t\tifstream ifs(in, ios::binary);\n\n\t\tint header_size = -1;\n\t\tifs.read(reinterpret_cast<char*>(&header_size), 4);\n\n\t\twhile (ifs.tellg() != header_size-2)\n\t\t{\n\t\t\tchar path_len = 0xFFu;\n\t\t\tifs.read(reinterpret_cast<char*>(&path_len), 1);\n\n\t\t\tpath_len++;\n\t\t\tstring path;\n\t\t\twhile (path_len--)\n\t\t\t\tpath.push_back(ifs.get());\n\n\t\t\tint offs = 0xFFFFFFFF;\n\t\t\tifs.read(reinterpret_cast<char*>(&offs), 4);\n\n\t\t\tEntry entry;\n\t\t\tentry.path = path;\n\t\t\tentry.offs = offs;\n\t\t\tEntries.push_back(entry);\n\t\t}\n\n\t\tfor (auto& entry : Entries)\n\t\t{\n\t\t\twhile (ifs.tellg() < entry.offs)\n\t\t\t\tentry.data.push_back(ifs.get());\n\t\t}\n\n\t\tifs.close();\n\t}\t\n};<\/code><\/pre>\n\n\n\n<\/p>\n\n\n\n
This gets us extraction… but what about creation?<\/p>\n\n\n\n
<\/p>\n\n\n\n
In order for us to create the format, we just need to do the complete opposite (again, the reverse<\/strong><\/em> operation!) of what we’ve done to extract the files, and instead, we must work backwards and create this data.<\/p>\n\n\n\nSo working back on ourselves, we can read every file that we want to put into the archive: <\/p>\n\n\n\n
\tsize_t create(fs::path outputDir)\n\t{\n\t\tfor (const auto& p : fs::recursive_directory_iterator(outputDir)) {\n\t\t\tif (!fs::is_regular_file(p))\n\t\t\t\tcontinue;\n\n\t\t\tSmallFile::Entry entry;\n\t\t\tentry.path = fs::relative(p.path(), outputDir).string();\n\t\t\tprintf(\"Adding '%s'... \", entry.path.c_str());\n\n\t\t\tifstream ifs(p.path(), ios::binary);\n\t\t\tif (ifs.good())\n\t\t\t{\n\t\t\t\tifs.seekg(0, ios::end);\n\t\t\t\tsize_t eof = ifs.tellg();\n\t\t\t\tifs.seekg(0, ios::beg);\n\n\t\t\t\twhile ((size_t)ifs.tellg() < eof)\n\t\t\t\t\tentry.data.push_back(ifs.get());\n\n\t\t\t\tprintf(\"(size: 0x%X).\\n\", (int)entry.data.size());\n\t\t\t}\n\t\t\tEntries.push_back(entry);\n\t\t}\n\t\treturn Entries.size();\n\t}<\/code><\/pre>\n\n\n\nSo, now, Entries<\/code> stores all of the data we want to put in already (the second half of the file format), the next step is to now create the “table of contents” (the first half) and for that, we need to precalculate the size of it, so that we can put its size at the beginning of the file (“the header”): <\/p>\n\n\n\n\tint calc_toc_size()\n\t{\n\t\tint toc_size = 4; \/\/ space for the toc size\n\t\tfor (const auto& entry : Entries)\n\t\t{\n\t\t\ttoc_size++;\t\t\t\t\t\t\t\t\/\/ space for path_len\n\t\t\ttoc_size += (char)entry.path.size();\t\/\/ string data\n\t\t\ttoc_size++;\t\t\t\t\t\t\t\t\/\/ null-byte\n\t\t\ttoc_size += 4;\t\t\t\t\t\t\t\/\/ offset\n\n\t\t}\n\t\treturn toc_size + 2;\n\t}<\/code><\/pre>\n\n\n\nHere, we simply loop over each entry in the file and account for each byte of data which it takes up. <\/p>\n\n\n\n
With this value, we can then simply write out each of the entries, then writing out all of the file data once we’re done!<\/p>\n\n\n\n
\t\t\t\/\/ write header\n\t\t\tint toc_size = calc_toc_size();\n\t\t\tofs.write(reinterpret_cast<char*>(&toc_size), 4);\n\n\t\t\t\/\/ write out toc\n\t\t\tint last_written_block = toc_size-2;\n\t\t\tfor (const auto& entry : Entries)\n\t\t\t{\n\t\t\t\tlast_written_block += (int)entry.data.size();\n\n\t\t\t\tchar path_len = (char)entry.path.size();\n\t\t\t\tofs.write(reinterpret_cast<char*>(&path_len), 1);\n\t\t\t\tofs.write(entry.path.c_str(), path_len);\n\t\t\t\tofs.put(0);\n\t\t\t\tofs.write(reinterpret_cast<char*>(&last_written_block), 4);\n\t\t\t}\n\n\t\t\t\/\/ write out all of the entries\n\t\t\tfor (const auto& entry : Entries)\n\t\t\t\tofs.write((char*)entry.data.data(), entry.data.size());\n\n\t\t\t\/\/ pad it out to 2K boundaries\n\t\t\tint aligned_end = (((int)ofs.tellp()) + 2048 - 1) & ~(2048 - 1);\n\t\t\twhile (ofs.tellp() < aligned_end)\n\t\t\t\tofs.put(0);\n<\/code><\/pre>\n\n\n\n<\/p>\n\n\n\n
Hopefully this explains some of the mentality and processes that you tend to go through when looking at this sort of thing and is useful to any beginners, or anyone just interested in it.<\/p>\n\n\n\n
The full code is available here:<\/p>\n\n\n\n
https:\/\/github.com\/LemonHaze420\/fextract<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"Reverse engineering is something of an “acquired skill” and that can be really discouraging for beginners, I know this because I was also a beginner and in a lot of ways I still consider myself one, but alas, it’s something which takes a good amount of time and patience, and also some previous foundational knowledge. […]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[],"yoast_head":"\n
Reverse Engineering Adventures #1: A Foundation - Team Wulinshu<\/title>\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\t \n\t \n\t \n