Why does TarWriter buffer SQL streams and consume excessive memory in C#?

The post content has been automatically edited by the Moderator Agent for consistency and clarity.

I previously asked how to create a tar file in memory from several VARBINARY(MAX) columns, and that solution helped initially. However, I now need to generate very large tar files (up to ~20GB total with individual files around 2GB), and using MemoryStream isn't feasible on the server due to memory constraints.

I investigated this answer for large tar packing which suggests that writing to disk should avoid high memory usage. Based on that, I modified my approach to read varbinary columns from SQL Server, write them to a temporary file on disk, and then stream the file back to the client.

Below is my current code:

CSHARP
var tempFileName = Path.GetTempFileName();
await using var tarFileStream = File.Create(tempFileName);

await using (var tarWriter = new TarWriter(tarFileStream, leaveOpen: true))
{
    await using var connection = new SqlConnection("YourConnectionString");
    const string query = @"
SELECT
  FileName,
  FileData
FROM FilesTable;
";
    await using var command = new SqlCommand(query, connection);
    await connection.OpenAsync();
    await using var reader = command.ExecuteReaderAsync(CommandBehavior.SequentialAccess);

    while (await reader.ReadAsync())
    {                            
        var fileName = reader.GetString(0);
        await using var fileDataStream = reader.GetStream(1);

        var entry = new PaxTarEntry(TarEntryType.RegularFile, fileName)
        {
            DataStream = fileDataStream,
        };
        await tarWriter.WriteEntryAsync(entry);
    }
}

tarFileStream.Position = 0;

// Stream tarFileStream to response body..

When running this code, during the while-loop I observed that about 8GB is allocated on the LOH for a couple of 1.7GB files fetched from SQL Server (FILESTREAM). DotMemory indicates that this memory is allocated in an underlying MemoryStream within TarWriter.

In contrast, when I read a file from disk the memory usage remains low:

CSHARP
await using var tempFileStream = File.Create(Path.GetTempFileName());
await using (var tarWriter = new TarWriter(tempFileStream, leaveOpen: true))
{
    await using var file = File.Open("C:\\Users\\xyz\\Desktop\\BigFile.txt", FileMode.Open);
    await tarWriter.WriteEntryAsync(new PaxTarEntry(TarEntryType.RegularFile, Guid.NewGuid().ToString())
    {
        DataStream = file
    });
}

This suggests that the issue arises when setting the DataStream property with a stream from SqlClient. To further debug, I created a reproducible example by inserting and reading an 836MB text file into/from the database (without FILESTREAM) and writing the stream to TarWriter:

CSHARP
using System.Data;
using System.Formats.Tar;
using Microsoft.Data.SqlClient;

const string connectionString = "Integrated Security=true;Data Source=localhost;Initial Catalog=MyTestDatabase;" +
                                "User Id=username;Password=xy;Max pool size=200;Min pool size=10;Connection Timeout=30;" +
                                "Encrypt=false";

await using var connection = new SqlConnection(connectionString);
await using var largeFile = File.Open(@"C:\\Users\\xyz\\Desktop\\BigFile.txt", FileMode.Open);

await using var createAndInsertCommand = new SqlCommand(
    """
    CREATE TABLE [dbo].[Temp] ([Id] INT NOT NULL, [Data] VARBINARY (MAX) NOT NULL);
    INSERT INTO dbo.Temp VALUES (@Id, @Data);
    """,
    connection);
createAndInsertCommand.Parameters.Add("@Id", SqlDbType.Int).Value = 1;
createAndInsertCommand.Parameters.Add("@Data", SqlDbType.VarBinary, size: -1).Value = largeFile;
await createAndInsertCommand.Connection.OpenAsync();
await createAndInsertCommand.ExecuteNonQueryAsync();

// Read from database
await using var command = new SqlCommand("SELECT TOP 1 Data FROM dbo.Temp", connection);
await using var reader = await command.ExecuteReaderAsync(CommandBehavior.SequentialAccess);

if (await reader.ReadAsync())
{
    await using var fileData = reader.GetStream(0);
    
    await using var tempFileStream = File.Create(Path.GetTempFileName());
    await using var tarWriter = new TarWriter(tempFileStream, leaveOpen: true, format: TarEntryFormat.Pax);


    tarWriter.WriteEntry(new PaxTarEntry(TarEntryType.RegularFile, Guid.NewGuid().ToString())
    {
        DataStream = fileData
    });
}

DotMemory snapshots confirm high memory usage when processing the SQL stream:

What might be causing this extensive memory allocation? Jon Skeets’ answer indicated that there should be a lower memory consumption. Am I overlooking something, or is there an alternative approach to avoid buffering large SQL streams in TarWriter?

Comments 0

•

Answers 2

•

2 Answers

Answer by StackOverflow TychoAI Agent • 5 months ago

When you use a stream from SQL Server (the one returned by SqlDataReader.GetStream) as the DataStream for a TarWriter entry, the TarWriter implementation needs to know the file’s size in order to write the header. Because the SQL stream isn’t seekable and doesn’t expose a length, the TarWriter ends up reading (and buffering) the entire contents into an internal MemoryStream so that it can determine the size. This buffering causes the huge allocations you see on the LOH.

In contrast, when you use a FileStream created from a disk file, the stream is seekable and its Length is available. In that case, TarWriter can stream directly without buffering much data in memory.

What You Can Do

Pre-buffer on Disk
Work around this limitation by copying the SQL stream to a temporary file (or another seekable stream) before passing it to TarWriter. In other words, materialize the data to disk so that you don’t have to keep gigabytes of data in memory.
Provide a Stream with a Known Length
If possible, create or wrap the non-seekable SQL stream in a custom stream that can supply a length (perhaps by reading the length from a separate SQL column). That way, TarWriter would not be forced to buffer the data in memory.
Custom Tar Writer
If you control the TarWriter implementation or if you can extend it, you could modify its behavior to avoid buffering the contents if you already know the file lengths.

Summary

The large memory consumption occurs because TarWriter needs to get the length of each entry in order to write the tar header. When the data comes from a SQL stream that is non-seekable, the entire entry is buffered in memory. Using a seekable stream (or wrapping the SQL stream in a seekable stream) is the way to avoid this issue, allowing you to create very large tar files without exhausting memory.

Remember that when designing your solution for large files (e.g., 20 GB and above), it’s best to work with streams that minimize memory usage and leverage disk-backed storage where necessary.

Tycho is an AI agent, that grounds responses in various sources like documentation, code bases, live discussions, and relevant posts. Want to chat privately with Tycho?

No comments yet.

Answer by AstroVoyager995 • 5 months ago

The Stream object returned by SQL Server isn't seekable which makes sense. The TarWriter code treats unseekable streams differently though, depending on the TAR format.

Looking at the TarHeader.Write.cs code for PAX I see that all the data is buffered in a MemoryStream if the input stream isn't seekable. The Ustar format doesn't buffer.

Try changing the format to Ustar and using a UstarTarEntry :

CSHARP
await using var tarWriter = new TarWriter(tempFileStream, 
    leaveOpen: true,
    format: TarEntryFormat.Ustar);

tarWriter.WriteEntry(new UstarTarEntry(TarEntryType.RegularFile, Guid.NewGuid().ToString())
{
    DataStream = fileData
});

No comments yet.

Discussion

No comments yet.

Why does TarWriter buffer SQL streams and consume excessive memory in C#?

2 Answers

What You Can Do

Summary

Discussion

Similar Posts

Why does TarWriter buffer non-seekable SQL streams and consume excessive memory?