Non-Transactional by defaut
The ZIP file format dictates by its nature that the creation of a zip file is a transactional operation. We cumulate a list of files to compress, with all metadata information, and we create the zip file in a single step, compressing each file sequentially, making sure to comply with a storage format that does not leave room for punctual updates. Imagine having to change the contents of a single file within a zip file. You have to rebuild the zip file from the beginning, by copying untouched files' compressed data to a new copy of the zip file, then append the modified file's compressed data, and complete the zip file with the new central directory and ending header.
On the opposite side, changing the contents of a file stored on your hard disk is simple. Each file is accessible randomly, and changing one's contents does not require moving or updating others. Take this for example:
byte[] mydata = System.Text.Encoding.Default.GetBytes( "This is important!" );
AbstractFile file = new DiskFile( @"d:\mydata.txt" );
if( !file.Exists )
file.Create();
using( Stream stream = file.OpenWrite( true ) )
{
stream.Write( mydata, 0, mydata.Length );
}
The operation is atomic on the file. The Xceed FileSystem's goal is to mimic this random file access to any possible representation of a file. Thus, exposing compressed files stored in a zip file is no simple task. With the above code, if you replace new DiskFile(...) with new ZippedFile(...), it will work as expected. What you don't see is that only when the stream gets closed will the zip file get rebuilt. All data that you write to the stream is compressed and stored in a temp file, until the last "modify" operation is completed on that zip file. Another example:
byte[] mydata = System.Text.Encoding.Default.GetBytes( "This is important!" );
AbstractFile file1 = new DiskFile( @"d:\mydata.txt" );
AbstractFile file2 = new DiskFile( @"d:\mydatatoo.txt" );
if( !file1.Exists )
file1.Create();
if( !file2.Exists )
file2.Create();
using( Stream stream1 = file1.OpenWrite( true ) )
{
using( Stream stream2 = file2.OpenWrite( true ) )
{
stream1.Write( mydata, 0, mydata.Length );
stream2.Write( mydata, 0, mydata.Length );
}
}
In the atomic world of disk files, both files have no influence on the other. But again, replace DiskFile instances with ZippedFiles, and it's another story. The two files are stored in a zip file, which can only get rebuilt when the last "modify" operation completes, thus when "stream1.Close" is called. Will the above code work? Sure! But the zip file will be rebuilt three times. Try it!
byte[] mydata = System.Text.Encoding.Default.GetBytes( "This is important!" );
AbstractFile zipFile = new DiskFile( @"d:\mydatafiles.zip" );
AbstractFile file1 = new ZippedFile( zipFile, @"\mydata.txt" );
AbstractFile file2 = new ZippedFile( zipFile, @"\mydatatoo.txt" );
if( !file1.Exists )
file1.Create();
Console.WriteLine( "Check the zip file with WinZip!" );
Console.WriteLine( "It should contain one empty file named 'mydata.txt'." );
Console.ReadLine();
if( !file2.Exists )
file2.Create();
Console.WriteLine( "Check the zip file with WinZip!" );
Console.WriteLine( "It should contain two empty files now." );
Console.ReadLine();
using( Stream stream1 = file1.OpenWrite( true ) )
{
using( Stream stream2 = file2.OpenWrite( true ) )
{
stream1.Write( mydata, 0, mydata.Length );
stream2.Write( mydata, 0, mydata.Length );
}
Console.WriteLine( "Check the zip file with WinZip!" );
Console.WriteLine( "It still contains two empty files." );
Console.ReadLine();
}
Console.WriteLine( "Check the zip file with WinZip!" );
Console.WriteLine( "Now it contains both files with their data." );
Console.ReadLine();
The first call to file1.Create increments the "modify" count to 1, then down to 0, so the zip file is built, containing an empty file. After the second call to Create, the zip file is again rebuilt, containing two empty files. When the first call to OpenWrite is made, the "modify" count gets up to 1. After the second call to OpenWrite, it's up to 2. Then stream2 is closed, and the count gets down to 1. Finally stream1 is closed, the count gets to 0, and the zip file is rebuilt, containing two files with compressed data.
In this simple example, the cost is not that much. Let's imagine worse:
byte[] mydata = System.Text.Encoding.Default.GetBytes( "This is important!" );
AbstractFile zipFile = new DiskFile( @"d:\mydatafiles.zip" );
if( zipFile.Exists )
zipFile.Delete();
for( int i=0; i<1000; i++ )
{
Console.WriteLine( "Loop {0}", i );
AbstractFile file = new ZippedFile( zipFile, @"\data" + i.ToString() + ".txt" );
if( !file.Exists )
file.Create();
using( Stream stream = file.OpenWrite( true ) )
{
stream.Write( mydata, 0, mydata.Length );
}
}
If you try this, you'll notice that each loop takes more time than the previous. Actually, when I tried this, I wasn't patient enough to wait until completion. The zip file would get rebuilt 2000 times, with more and more files already in the zip file. This is plainly unacceptable.
Transactional on demand
That's where the IBatchUpdateable interface comes to the rescue. It contains two simple methods: BeginUpdate and EndUpdate. Any AbstractFile or AbstractFolder's derived class can implement this interface, though you can limit this to the root folder. Once BeginUpdate is called, the implementor can hold any modifications to the underlying media until EndUpdate is called. ZipArchive, which represents the root ZippedFolder for a zip file, implement this interface. In short, BeginUpdate artificially increments the "modify" count to 1, and EndUpdate decrements it. If it gets to 0, the underlying zip file is rebuilt. You can call BeginUpdate and EndUpdate as many times as you want, but every call to BeginUpdate must be matched with a call to EndUpdate. The above code could now look like this:
byte[] mydata = System.Text.Encoding.Default.GetBytes( "This is important!" );
AbstractFile zipFile = new DiskFile( @"d:\mydatafiles.zip" );
ZipArchive zip = new ZipArchive( zipFile );
if( zipFile.Exists )
zipFile.Delete();
zip.BeginUpdate();
try
{
for( int i=0; i<1000; i++ )
{
Console.WriteLine( "Loop {0}", i );
AbstractFile file = new ZippedFile( zipFile, @"\data" + i.ToString() + ".txt" );
if( !file.Exists )
file.Create();
using( Stream stream = file.OpenWrite( true ) )
{
stream.Write( mydata, 0, mydata.Length );
}
}
}
finally
{
zip.EndUpdate();
}
Now that's better. On my machine, this takes a few seconds.
The FileSystem's main goal was to offer a unique and consistent interface for manipulating any kind of file or folder. That's why we decided that ZippedFile and ZippedFolder were to be non-transactional by default, even though in most cases, it will end-up producing less efficient code. It's the user's job to call BeginUpdate before modifying the zip file, and EndUpdate once completed, to achieve better performance.
By the way, for those who like the using( IDisposable ) pattern in C#, you can use the AutoBatchUpdate class like this:
byte[] mydata = System.Text.Encoding.Default.GetBytes( "This is important!" );
AbstractFile zipFile = new DiskFile( @"d:\mydatafiles.zip" );
ZipArchive zip = new ZipArchive( zipFile );
if( zipFile.Exists )
zipFile.Delete();
using( AutoBatchUpdate auto = new AutoBatchUpdate( zip ) )
{
for( int i=0; i<1000; i++ )
{
Console.WriteLine( "Loop {0}", i );
AbstractFile file = new ZippedFile( zipFile, @"\data" + i.ToString() + ".txt" );
if( !file.Exists )
file.Create();
using( Stream stream = file.OpenWrite( true ) )
{
stream.Write( mydata, 0, mydata.Length );
}
}
}
The AutoBatchUpdate implements IDisposable, making sure to call BeginUpdate on the object at construction, and EndUpdate when disposed. What's even better is that you can pass any FileSystemItem: it will do nothing if the item's RootFolder does not implement IBatchUpdateable. Thus, you can use AutoBatchUpdate without having to know if the AbstractFile or AbstractFolder you're working with implements IBatchUpdateable or not.
Temp storage
Now, it's good to know that when using BeginUpdate and EndUpdate, the zip file is rebuilt only at the very last moment, but where goes the compressed data I'm writing to the streams? It must be stored somewhere, right? The ZipArchive class exposes two important properties: DefaultTempFolder (static) and TempFolder. By default, the first is equal to new DiskFolder( System.IO.Path.GetTempPath() ), the temp folder of the currently logged-in user. You can assign to it any AbstractFolder, as long as AbstractFile instances created in that folder yield seekable streams (ZippedFile.OpenWrite does not return a seekable stream).
Everytime you create the first instance of a ZipArchive for a given zip file, its TempFolder property is initialized to the value of DefaultTempFolder. Thus, if you assign a folder to the static DefaultTempFolder property, it will apply to all new instances of ZipArchive. If you assign a folder to the TempFolder property, it will only affect ZippedFile, ZippedFolder and ZipArchive instances dealing with that zip file.
If you run the above code while watching your temporary folder using Explorer (hit F5 a few times), you'll see appear and disapear filenames like "XFS330fe108-13b8-4ebb-2299-cace5fa0100a.tmp". Those files are holding the compressed data until the zip file gets rebuilt. Most serious zip libraries allow to use memory instead of a disk folder while zipping. For example, the Xceed Zip ActiveX exposes the UseTempFile property. When set to false, the library stores temp data in memory while building the zip file. With Xceed Zip for .NET, you achieve this by setting ZipArchive.DefaultTempFolder to new MemoryFolder(). Voilà! You are storing temporary data in memory. This is very useful for ASP.NET applications that cannot write on disk. And even better: it also works when updating existing zip files. But watch out! Don't zip gigabytes of files while using a MemoryFolder. There is a time for a MemoryFolder, and there is a time for a DiskFolder.