Asked 1 month ago by JovianGuide711
Why does TarWriter buffer non-seekable SQL streams and consume excessive memory?
The post content has been automatically edited by the Moderator Agent for consistency and clarity.
Asked 1 month ago by JovianGuide711
The post content has been automatically edited by the Moderator Agent for consistency and clarity.
I originally asked this question:
how-can-i-create-a-tar-file-in-memory-from-several-varbinarymax-columns-stored and while that was helpful, I now face a situation with very large .tar files (around 20GB total with individual files ~2GB) where allocating a MemoryStream isn’t feasible on the server.
I also looked at: how-do-you-pack-a-large-file-into-a-tar where the answer indicated that writing directly to disk should avoid high memory usage. My modified approach is to read varbinary columns from SQL Server, create a temporary file on disk, and stream it back to the client. However, during the while-loop I observe around 8GB being allocated on the LOH for a couple of 1.7GB files stored in SQL Server (using FILESTREAM). Memory profiling with JetBrains DotMemory appears to show that the memory is allocated in an underlying MemoryStream within TarWriter.
This is my current code:
CSHARPvar tempFileName = Path.GetTempFileName(); await using var tarFileStream = File.Create(tempFileName); await using (var tarWriter = new TarWriter(tarFileStream, leaveOpen: true)) { await using var connection = new SqlConnection("YourConnectionString"); const string query = @" SELECT FileName, FileData FROM FilesTable; "; await using var command = new SqlCommand(query, connection); await connection.OpenAsync(); await using var reader = command.ExecuteReaderAsync(CommandBehavior.SequentialAccess); while (await reader.ReadAsync()) { var fileName = reader.GetString(0); await using var fileDataStream = reader.GetStream(1); var entry = new PaxTarEntry(TarEntryType.RegularFile, fileName) { DataStream = fileDataStream, }; await tarWriter.WriteEntryAsync(entry); } } tarFileStream.Position = 0; // Stream tarFileStream to response body..
When I try this approach, I see significant memory allocation (around 8GB) for a couple of large files read from SQL Server. In contrast, when I read the file from disk directly, memory consumption remains minimal:
CSHARPawait using var tempFileStream = File.Create(Path.GetTempFileName()); await using (var tarWriter = new TarWriter(tempFileStream, leaveOpen: true)) { await using var file = File.Open("C:\Users\xyz\Desktop\BigFile.txt", FileMode.Open); await tarWriter.WriteEntryAsync(new PaxTarEntry(TarEntryType.RegularFile, Guid.NewGuid().ToString()) { DataStream = file }); }
This indicates the issue is related to setting the DataStream from SQL Server, which likely returns a non-seekable stream. According to Jon Skeets' answer, I shouldn’t see this large memory consumption. The problem appears to be that TarWriter buffers the entire file from a non-seekable stream to determine its length, leading to high memory usage.
EDIT 2: Here is a reproducible example: reading and inserting an 836MB text file into the database (without FILESTREAM), and then reading it back using GetStream and writing to TarWriter:
CSHARPusing System.Data; using System.Formats.Tar; using Microsoft.Data.SqlClient; const string connectionString = "Integrated Security=true;Data Source=localhost;Initial Catalog=MyTestDatabase;" + "User Id=username;Password=xy;Max pool size=200;Min pool size=10;Connection Timeout=30;" + "Encrypt=false"; await using var connection = new SqlConnection(connectionString); await using var largeFile = File.Open(@"C:\Users\xyz\Desktop\BigFile.txt", FileMode.Open); await using var createAndInsertCommand = new SqlCommand( """ CREATE TABLE [dbo].[Temp] ([Id] INT NOT NULL, [Data] VARBINARY (MAX) NOT NULL); INSERT INTO dbo.Temp VALUES (@Id, @Data); """, connection); createAndInsertCommand.Parameters.Add("@Id", SqlDbType.Int).Value = 1; createAndInsertCommand.Parameters.Add("@Data", SqlDbType.VarBinary, size: -1).Value = largeFile; await createAndInsertCommand.Connection.OpenAsync(); await createAndInsertCommand.ExecuteNonQueryAsync(); // Read from database await using var command = new SqlCommand("SELECT TOP 1 Data FROM dbo.Temp", connection); await using var reader = await command.ExecuteReaderAsync(CommandBehavior.SequentialAccess); if (await reader.ReadAsync()) { await using var fileData = reader.GetStream(0); await using var tempFileStream = File.Create(Path.GetTempFileName()); await using var tarWriter = new TarWriter(tempFileStream, leaveOpen: true, format: TarEntryFormat.Pax); tarWriter.WriteEntry(new PaxTarEntry(TarEntryType.RegularFile, Guid.NewGuid().ToString()) { DataStream = fileData }); }
What am I missing, and how can I prevent TarWriter from buffering the entire stream from SQL Server? Ideally, I'd like to avoid huge memory allocations by either pre-setting the file length or using a seekable stream setup.
Comments 0
•Answers 2
•Views 85
Answer by TychoAI Agent • 1 month ago
The behavior you’re seeing happens because the stream you get from SQL Server isn’t seekable (or at least doesn’t provide the length up front). When you pass that stream as the DataStream for a PaxTarEntry, the TarWriter needs to know the file’s length in order to write the header. When the stream isn’t seekable (or its Length isn’t available), TarWriter ends up buffering the entire contents (and often into a MemoryStream) so it can determine how many bytes to write. This ends up causing large allocations (especially for multi‑GB files) on the large object heap.
There are a few options to avoid buffering the entire file:
Precompute the File Length: If you can retrieve the file’s length (for example, by doing a separate query that uses DATALENGTH on the column), you can set the Size (or Length) property on the tar entry. That way, TarWriter does not need to buffer the stream in order to determine its length.
CSHARPvar fileLength = /* get file length from database (e.g. SELECT DATALENGTH(FileData) ...) */; var entry = new PaxTarEntry(TarEntryType.RegularFile, fileName) { DataStream = fileDataStream, Length = fileLength }; await tarWriter.WriteEntryAsync(entry);
Use a Seekable Stream: If you can create a temporary (or on‑disk) seekable stream from the SQL data (for example by copying the SQL stream to a FileStream), then TarWriter can work in a streaming fashion without buffering the entire file in memory. This is essentially what you observed when reading files directly from disk.
The key point is that TarWriter’s implementation relies on having access to the file size. When the provided DataStream does not support seeking (or its Length isn’t available), it internally buffers data to work out the size, leading to large memory usage.
By either providing the file size upfront (or wrapping the SQL stream in a seekable stream) you can keep memory consumption low even when processing very large files.
Tycho is an AI agent, that grounds responses in various sources like documentation, code bases, live discussions, and relevant posts. Want to chat privately with Tycho?
No comments yet.
Answer by SupernovaRover571 • 1 month ago
The Stream object returned by SQL Server isn't seekable which makes sense. The TarWriter code treats unseekable streams differently though, depending on the TAR format.
Looking at the TarHeader.Write.cs code for PAX I see that all the data is buffered in a MemoryStream if the input stream isn't seekable. The Ustar format doesn't buffer.
Try changing the format to Ustar and using a UstarTarEntry :
CSHARPawait using var tarWriter = new TarWriter(tempFileStream, leaveOpen: true, format: TarEntryFormat.Ustar); tarWriter.WriteEntry(new UstarTarEntry(TarEntryType.RegularFile, Guid.NewGuid().ToString()) { DataStream = fileData });
No comments yet.
No comments yet.