Zip files are used for both general archives, preparing to send multiple files but also as storage for current word processing packges. The current packages are Office, 2007 and later, and Open Office, .ODT files.
The reason for the zip framework is that a .DOCX files is based on several XML files that are very verbose, and easily compressed. Thus a file that would be maybe 100K is reduced to nearer 10K. Thus zipping the files saves space, and also reduces the chance that file will be fragmented. If fragmented, the CnW data carving function will recover many such files, and the process is described briefly below
Zip files are fairly straight forward to defragment as they have a well defined data structure, helped by sections with pointers and lengths. As mentioned about, many DOCX files are fairly small, and will not be fragmented more than a few times at worst case - the exception is when they have embedded photos.
The Zip file structure.
The basic file structure is well documented, ( http://www.phpconcept.net/pclzip/pkzip.txt is one such link) so the following is just a brief outline.
File signature
The basic signature is 'PK' followed by 0x03 0x04 which is a local file header
00000000 50 4B 03 04 14 00 00 00 - 08 00 47 72 23 39 CC 1E PK Gr#9
00000010 5C F4 57 3B 00 00 68 A0 - 00 00 18 00 00 00 6C 69 \W; h li
00000020 62 2F 61 75 74 6F 2F 57 - 69 6E 33 32 2F 57 69 6E b/auto/Win32/Win
00000030 33 32 2E 64 6C 6C ED 7D - 0B 78 54 D5 D5 E8 C9 7B 32.dll} xT{
00000040 80 81 09 90 60 84 00 03 - 24 10 CA 6B F2 22 C9 3C ` $k"<
0x00 local file header signature 4 bytes ( PK 0x03 0x04 )
0x04 version needed to extract 2 bytes
0x06 general purpose bit flag 2 bytes
0x08 compression method 2 bytes
0x0a last mod file time 2 bytes
0x0c last mod file date 2 bytes
0x0e crc-32 4 bytes
0x12 compressed size 4 bytes
0x16 uncompressed size 4 bytes
0x1a filename length 2 bytes
0x1c extra field length 2 bytes
filename (variable size)
extra field (variable size)
The example above shows that the compressed size of the file is 0x3b57 and uncompressed is 0xa068. The file name is 0x18 bytes long and so the compressed string starts at location 0x1e (length of header) + 0x18 (name length), ie offset 0x36. As we know that this section is 0x3b57 bytes long, the next PK header will be at location 0x3b57 + 0x36, ie 0x3b8d.
On a fragmented file, the technique is to search for a PK header which has an offset within a cluster of 0x3b8d. Thus for a cluster size of 0x4000 bytes, it would be offset 0x3b8d, but for a cluster size of 0x1000 bytes, the offset would be 0xb8d. With a limited number of Zip files, the chance of a miss match is limited. The header sumcheck can be verified to make sure it is valid PK header.
00003B80 16 5B 24 20 FF 62 A5 B5 - 76 AB F0 3F 00 50 4B 03 [$ bv? PK
00003B90 04 14 00 00 00 08 00 DC - 6A 23 39 E2 F1 1B 49 02 j#9I
00003BA0 3B 00 00 61 A0 00 00 1C - 00 00 00 6C 69 62 2F 61 ; a lib/a
00003BB0 75 74 6F 2F 57 69 6E 33 - 32 2F 57 69 6E 33 32 2E uto/Win32/Win32.
00003BC0 64 6C 6C 2E 41 41 41 ED - 7D 7B 78 54 D5 B5 F8 C9 dll.AAA}{xTյ
As can be seen a new PK header is found in the correct location. This process can be continued through the file.
Central Register
Towards the end of the file a central regsiter is stored. This is a directory of all files within the Zip file
000033E0 2C 5B 02 17 72 AC F2 FF - 01 00 00 FF FF 03 00 50 ,[r P
000033F0 4B 01 02 2D 00 0A 00 00 - 00 00 00 00 00 21 00 5E K- ! ^
00003400 C6 32 0C 27 00 00 00 27 - 00 00 00 08 00 00 00 00 2 ' '
00003410 00 00 00 00 00 00 00 00 - 00 00 00 00 00 6D 69 6D mim
00003420 65 74 79 70 65 50 4B 01 - 02 2D 00 14 00 06 00 08 etypePK-
0x00 central file header signature 4 bytes PK 0x01 0x02
0x04 version made by 2 bytes
0x06 version needed to extract 2 bytes
0x08 general purpose bit flag 2 bytes
0x0a compression method 2 bytes
0x0c last mod file time 2 bytes
0x0e last mod file date 2 bytes
0x10 crc-32 4 bytes
0x14 compressed size 4 bytes
0x18 uncompressed size 4 bytes
0x1c filename length 2 bytes
0x1e extra field length 2 bytes
0x20 file comment length 2 bytes
0x22 disk number start 2 bytes
0x24 internal file attributes 2 bytes
0x26 external file attributes 4 bytes
0x2a relative offset of local header 4 bytes
0x2e filename (variable size)
extra field (variable size)
file comment (variable size)
The central regsiter can be used to verify the file structure and that all elements are present and correct. If there is an error, then it is likely that somewhere there has been a false positive match.
Final header
The final header is basically a pointer to the start of the central regsiter
end of central dir signature 4 bytes (PK 0x05 0x06)
number of this disk 2 bytes
number of the disk with the
start of the central directory 2 bytes
total number of entries in
the central dir on this disk 2 bytes
total number of entries in
the central dir 2 bytes
size of the central directory 4 bytes
offset of start of central
directory with respect to
the starting disk number 4 bytes
.ZIP file comment length 2 bytes
.ZIP file comment (variable size)
00003540 74 79 6C 65 73 2E 78 6D - 6C 50 4B 05 06 00 00 00 tyles.xmlPK
00003550 00 06 00 06 00 5A 01 00 - 00 EF 33 00 00 00 00 Z 3
CnW Zip Recovery
The CnW routine can be called after the data carving has detected corrupted - possibly fragmented - Zip files. It will run the above techniques to scan the hard drive / memory chip for fragments that fit the zip file.