An extensible, state of the art columnar file format. Formerly at @spiraldb, now a Linux Foundation project (@LFAIDataFdn). Apache-2.0vortex.dev GithubJoined May 2025
Just announced at Interrupt! SmithDB.
Agent traces have outgrown the databases built to hold them.
That’s why we built SmithDB, a purpose-built distributed database for agent observability.
Read the announcement from Co-Founder @ankush_gola11 → langchain.com/blog/introduci…
We leveraged two amazing open source projects when building SmithDB.
One is @ApacheDataFusio: an extensible Rust based query engine. We built custom execution plans specifically tuned for our workloads and storage backend, and DataFusion made it straightforward to plumb everything together.
The other is @vortexdotdev: an extensible file format that allows you to build custom layouts with specific encoding and chunking strategies for different columns.
I would highly recommend checking out both of these projects if you're interested in modern data systems.
We built SmithDB: the database purpose built for agent observability workloads that now powers many parts of LangSmith.
Agent observability presents a challenging data problem. Agent traces can contain tens of thousands of intermediate spans and large, unbounded payloads. These
The Research Behind Modern Data Compression & @vortexdotdev
When we chose Vortex as the storage layer for Spice Cayenne (the data accelerator engine in Spice), we were betting on decades of database research finally reaching production-ready maturity.
Here's the research behind Vortex:
📄 BtrBlocks (SIGMOD 2023) - The core algorithm from the Technical University of Munich. Cascading multiple lightweight encodings outperforms monolithic compression. Optimize for decompression speed, not just compression ratio.
📄 FastLanes (VLDB 2023) - Hardware-friendly integer compression. Structures bit-packing to maximize SIMD utilization across AVX-512, AVX2, and ARM NEON. Near-memory-bandwidth decompression.
📄 FSST (VLDB 2020) - Fast Static Symbol Table for strings. Near-LZ4 ratios at 5-10× faster decompression. Critical for string-heavy columns.
📄 ALP (CWI Amsterdam) - Adaptive Lossless floating-Point compression. Exploits real-world float patterns (prices with 2 decimals, sensor readings with limited precision).
📄 MonetDB/X100 + Morsel-Driven Parallelism - Foundations for vectorized, NUMA-aware query execution that Vortex builds on.
The result? Compression that is tailored to your data:
• Integers via FastLanes bit-packing
• Floats via ALP adaptive encoding
• Strings via FSST symbol tables
• Timestamps via delta encoding
• Sorted columns via run-length encoding
Why does this matter for production systems?
1️⃣ Query performance scales with decompression speed. Focus on decode performance translates directly to faster queries.
2️⃣ Automatic encoding selection means zero configuration. The algorithm samples your data and picks optimal strategies per column.
3️⃣ SIMD acceleration is baked in. FastLanes was designed for vectorized, hardware accelerated execution from day one.
4️⃣ Zero-copy Arrow access. Data decompresses directly to Arrow arrays with no intermediate copies.
Vortex is now a Linux Foundation AI & Data project, and researchers are building on it (Anyblox, F3). You get SOTA research in production systems.
The future of data storage is exciting.
To learn more about our Vortex implementation, check out the blog: hubs.ly/Q04bGfvf0#datafusion #ai#data#vortex#spiceai#arrow#parquet
you took up with Weasley, but he can't afford sliceable cascaded encodings.
now your random access is dogged, and your cortisol is properly spiked, potter
DuckDB now supports reading from and writing to the Vortex file format! The DuckDB Labs and Spiral teams have worked together to make Vortex available as a core extension in DuckDB.
Vortex is an open source, columnar file format whose design is heavily influenced by recent
🌪️ Why LF Vortex for hot data?
@ApacheParquet great compression, slow decode
@ApacheArrow instant decode, no compression
Vortex: encoding-efficient compression with SIMD decode to Arrow
80% of Parquet's compression, 10x faster decode
Happy to share that I've been nominated to the @vortexdotdev Technical Steering Committee! It's been fun and productive switching to Vortex from Parquet as our storage format at Polar Signals and I'm excited to continue contributing to the Vortex project.
Super cool, they forked @DeltaLakeOSS to replace Parquet (for data) with Vortex and JSON (for metadata) with Vortex. Huge performance gains!
Maybe we should upstream this one 😁 @vortexdotdev
🧊 New on the Polar Signals Blog — Our Delta Lake Fork
Purpose-built for our continuous profiling product. In our latest post, we walk through how Delta Lake works, and the changes we've made to improve performance for our product.
👉 Read the full post: buff.ly/KwHINtO
We completed a major project to switch our storage file format from Parquet to Vortex 🌪️ resulting in 70% average query performance improvement across the board 🚀
Learn more about how rethinking interface-imposed limitations unlocked these gains in our latest blog post 👇
We completed a major project to switch our storage file format from Parquet to Vortex 🌪️ resulting in 70% average query performance improvement across the board 🚀
Learn more about how rethinking interface-imposed limitations unlocked these gains in our latest blog post 👇
The talk on @SpiralDB at @CMUDB : youtube.com/watch?v=zyn_T5… is a great one.
I think it would also be interesting to hear a counterpoint about @ApacheParquet that explains actual technical details of that format, the Cathedral vs Bizzaar management, options with Metadata, etc
2K Followers 5K FollowingPushing to position meaning between man and machine... Please note: views expressed, if you couldn't have guessed, are mine alone.
1K Followers 449 FollowingDistributed systems & database nerd. + gamedev and photography. #DevZen co-host. Mastodon: https://t.co/IxKcKv3MHn
SWE @langfuse. Opinions are my own
533 Followers 2K Following📈 AI Engineer @ https://t.co/Du2lQ9AFgU 🗞 Ho scritto spiegoni @ilpost 🎓 MSc Econ & Stats @LaStatale 🎓 BA Filosofia @UniBergamo & @SorbonneParis1
1K Followers 4K FollowingDeveloper and linguist based in Chicago. PhD from Berkeley. I've also taught at McGill and UBC. I pretend to hate puns but I really don't
24K Followers 60 FollowingDuckDB is an analytical in-process SQL database management system. "DuckDB" and the DuckDB logo are registered trademarks of the DuckDB Foundation.