I'm trying to work with data stored in a Snowflake database using polars in python. I see I can access the data with pl.read_database_uri with the adbc engine. I was wondering how I can do this efficiently for larger-than-memory datasets.
Is it possible to stream the results using polar's lazy API, or any other method?
Is it possible to batch the results as pl.read_database can? Or is it possible to partition the results, as the docs say is possible with connectorx?
Are there any other ways I might use polars to help work with larger-than-memory datasets in this instance? Or do I need to do my processing in SQL so that the data comes into python in a manageable size?
Thanks!
I'm trying to work with data stored in a Snowflake database using polars in python. I see I can access the data with pl.read_database_uri with the adbc engine. I was wondering how I can do this efficiently for larger-than-memory datasets.
Is it possible to stream the results using polar's lazy API, or any other method?
Is it possible to batch the results as pl.read_database can? Or is it possible to partition the results, as the docs say is possible with connectorx?
Are there any other ways I might use polars to help work with larger-than-memory datasets in this instance? Or do I need to do my processing in SQL so that the data comes into python in a manageable size?
Thanks!
python
snowflake-cloud-data-platform
python-polars
Share
Improve this question
asked Jan 31 at 19:25
user2966505user29665056711 silver badge55 bronze badges
Add a comment
|
1 Answer
1
Reset to default
0
As of polars==1.25.2 there's not an easy way to do this.
One way I've approached this problem is to use the Snowflake Connector for Python to iteratively retrieve batches of of a query result and process those batches using polars.
But I encountered some surprising Snowflake Connector behavior when doing this:
Queries which produce zero rows return None rather than an empty table. This is possible to work around with fetch_arrow_all(..., force_return_table=True) (docs).
When using fetch_arrow_batches() , column datatypes can vary among batches