Fragmented Zip and DOCX files
Previous Topic  Next Topic 

Home



Zip files are used for both general archives, preparing to send multiple files but also as storage for current word processing packges.  The current packages are Office, 2007 and later, and Open Office, .ODT files.


The reason for the zip framework is that a .DOCX files is based on several XML files that are very verbose, and easily compressed.  Thus a file that would be maybe 100K is reduced to nearer 10K.  Thus zipping the files saves space, and also reduces the chance that file will be fragmented.  If fragmented, the CnW data carving function will recover many such files, and the process is described briefly below


Zip files are fairly straight forward to defragment as they have a well defined data structure, helped by sections with pointers and lengths.  As mentioned about, many DOCX files are fairly small, and will not be fragmented more than a few times at worst case - the exception is when they have embedded photos.


The Zip file structure.


The basic file structure is well documented, ( http://www.phpconcept.net/pclzip/pkzip.txt is one such link) so the following is just a brief outline.


File signature

The basic signature is 'PK'  followed by 0x03 0x04 which is a local file header


00000000   50 4B 03 04 14 00 00 00 - 08 00 47 72 23 39 CC 1E    PK    Gr#9

00000010   5C F4 57 3B 00 00 68 A0 - 00 00 18 00 00 00 6C 69    \W;  h     li

00000020   62 2F 61 75 74 6F 2F 57 - 69 6E 33 32 2F 57 69 6E    b/auto/Win32/Win

00000030   33 32 2E 64 6C 6C ED 7D - 0B 78 54 D5 D5 E8 C9 7B    32.dll} xT{

00000040   80 81 09 90 60 84 00 03 - 24 10 CA 6B F2 22 C9 3C    ` $k"<



       0x00 local file header signature     4 bytes  ( PK 0x03 0x04 )

       0x04 version needed to extract       2 bytes

       0x06 general purpose bit flag        2 bytes

       0x08 compression method              2 bytes

       0x0a last mod file time              2 bytes

       0x0c last mod file date              2 bytes

       0x0e crc-32                          4 bytes

       0x12 compressed size                 4 bytes

       0x16 uncompressed size               4 bytes

       0x1a filename length                 2 bytes

       0x1c extra field length              2 bytes


        filename (variable size)

        extra field (variable size)


The example above shows that the compressed size of the file is 0x3b57 and uncompressed is 0xa068.  The file name is 0x18 bytes long and so the compressed string starts at location 0x1e (length of header) + 0x18 (name length), ie offset 0x36.  As we know that this section is 0x3b57 bytes long, the next PK header will be at location 0x3b57 + 0x36, ie 0x3b8d.


On a fragmented file, the technique is to search for a PK header which has an offset within a cluster of 0x3b8d.  Thus for a cluster size of 0x4000 bytes, it would be offset 0x3b8d, but for a cluster size of 0x1000 bytes, the offset would be 0xb8d.  With a limited number of Zip files, the chance of a miss match is limited.  The header sumcheck can be verified to make sure it is valid PK header.


00003B80   16 5B 24 20 FF 62 A5 B5 - 76 AB F0 3F 00 50 4B 03    [$ bv? PK

00003B90   04 14 00 00 00 08 00 DC - 6A 23 39 E2 F1 1B 49 02        j#9I

00003BA0   3B 00 00 61 A0 00 00 1C - 00 00 00 6C 69 62 2F 61    ;  a     lib/a

00003BB0   75 74 6F 2F 57 69 6E 33 - 32 2F 57 69 6E 33 32 2E    uto/Win32/Win32.

00003BC0   64 6C 6C 2E 41 41 41 ED - 7D 7B 78 54 D5 B5 F8 C9    dll.AAA}{xTյ


As can be seen a new PK header is found in the correct location.  This process can be continued through the file.



Central Register


Towards the end of the file a central regsiter is stored.  This is a directory of all files within the Zip file


000033E0   2C 5B 02 17 72 AC F2 FF - 01 00 00 FF FF 03 00 50    ,[r   P

000033F0   4B 01 02 2D 00 0A 00 00 - 00 00 00 00 00 21 00 5E    K-         ! ^

00003400   C6 32 0C 27 00 00 00 27 - 00 00 00 08 00 00 00 00    2 '   '      

00003410   00 00 00 00 00 00 00 00 - 00 00 00 00 00 6D 69 6D                 mim

00003420   65 74 79 70 65 50 4B 01 - 02 2D 00 14 00 06 00 08    etypePK-   



        0x00 central file header signature   4 bytes   PK 0x01 0x02

        0x04 version made by                 2 bytes

        0x06 version needed to extract       2 bytes

        0x08 general purpose bit flag        2 bytes

        0x0a compression method              2 bytes

        0x0c last mod file time              2 bytes

        0x0e last mod file date              2 bytes

        0x10 crc-32                          4 bytes

        0x14 compressed size                 4 bytes

        0x18 uncompressed size               4 bytes

        0x1c filename length                 2 bytes

        0x1e extra field length              2 bytes

        0x20 file comment length             2 bytes

        0x22 disk number start               2 bytes

        0x24 internal file attributes        2 bytes

        0x26 external file attributes        4 bytes

        0x2a relative offset of local header 4 bytes


        0x2e filename (variable size)

        extra field (variable size)

        file comment (variable size)


The central regsiter can be used to verify the file structure and that all elements are present and correct.  If there is an error, then it is likely that somewhere there has been a false positive match.



Final header


The final header is basically a pointer to the start of the central regsiter



        end of central dir signature    4 bytes  (PK 0x05 0x06)

        number of this disk             2 bytes

        number of the disk with the

        start of the central directory  2 bytes

        total number of entries in

        the central dir on this disk    2 bytes

        total number of entries in

        the central dir                 2 bytes

        size of the central directory   4 bytes

        offset of start of central

        directory with respect to

        the starting disk number        4 bytes

        .ZIP file comment length        2 bytes

        .ZIP file comment       (variable size)




00003540   74 79 6C 65 73 2E 78 6D - 6C 50 4B 05 06 00 00 00    tyles.xmlPK  

00003550   00 06 00 06 00 5A 01 00 - 00 EF 33 00 00 00 00          Z  3    



CnW Zip Recovery


The CnW routine can be called after the data carving has detected corrupted - possibly fragmented - Zip files.  It will run the above techniques to scan the hard drive / memory chip for fragments that fit the zip file.