September 23, 2024
Amadeo Pellicce
The Simplest Possible Data Lake - S3 Hive Parquet & DuckDB Support
We have always been fans of companies like Supabase, which galvanize the developer community with quarterly launch weeks. In our first Launch Week, we will take the opportunity to share some of the things we’ve been working on at Evefan.
We are happy to do so during the DuckDB Small Data Conference in San Francisco and the Cloudflare Birthday Week.
Ingestion AND Querying at the Internet Scale
When we launched last month, we focused on making it simple for organizations to privately capture, transform, and deliver customer events at an internet scale. This means you can deploy a single instance of Evefan to Cloudflare and avoid hosting up many services to accomplish the same.
Since then, we’ve had the opportunity to speak to users with diverse data needs. We validated our critical hypothesis that folks found existing solutions too tricky to self-host at scale and had to spend significant time stitching them up. We’ve successfully made that easier by bundling all the key events pipeline and streaming logic into a single Javascript executable that runs reliably and at scale on Cloudflare workers.
However, after the data was delivered to their destinations, customers mentioned that they still found it difficult to query it without bringing in another product.
“So, in short, your product requires me to sign up for another product for it to be useful?”
Today, we solved this with the launch of the S3 Merge and Query engine, Hive Parquet Partitioning, and official DuckDB support.
S3 Destination with Parquet Support
First and foremost, we added a new S3 native destination to Evefan. The data is stored in Parquet files and constructed using the official Apache Arrow and Parquet Wasm bindings. With Evefan’s batching functionality, you can now expect to output event data in the most efficient storage format using the big data world: Parquet.
S3 Merge and Query Engine
Evefan now supports all the S3 query endpoints (GET, HEAD, and LIST) via our v1/s3 handlers. However, instead of simply proxying these requests, we implemented a cool virtualization and materialization of the read process, which solves the problem of ‘too many small files’. The data is stored in Hive partitions, and coupled with range requests; query engines can SELECT a large amount of data directly from object storage.
DuckDB Support
You can try it out today using DuckDB. Let’s see an example with Cloudflare’s R2.
And subsequently, query your Events data with plain old SQL.
Thanks to DuckDB's power, this all happens in a process straight from your client to your data via the worker, which simply organizes the data in a simpler way for the engine.
Cloudflare’s R2 has no egress fees, making it incredibly cost-effective to store and query large amounts of data directly to your clients without intermediary servers.
We’re incredibly excited to release this feature in Alpha mode! As we continue to listen to the community's feedback, expect several improvements in the next couple of weeks.
Visiting DuckDB’s Small Data conf? Say hi, and let's talk about Small Data Pipelines!
Stay tuned for four more exciting announcements this first launch week!
Credits
Thank you to the incredibly supportive community members @kylebarron and @H-Plus-Time for their work on Parquet-wasm and their help, and @danthegoodman1 and @jaychia for all the guidance!
Back to Blogs