Reading/Writing Parquet files¶
If you have built
pyarrow with Parquet support, i.e.
found during the build, you can read files in the Parquet format to/from Arrow
memory structures. The Parquet support code is located in the
pyarrow.parquet module and your package needs to be built with the
--with-parquet flag for
To read a Parquet file into Arrow memory, you can use the following code
snippet. It will read the whole Parquet file into memory as an
import pyarrow.parquet as pq table = pq.read_table('<filename>')
As DataFrames stored as Parquet are often stored in multiple files, a
read_multiple_files() is provided.
If you already have the Parquet available in memory or get it via non-file
source, you can utilize
pyarrow.io.BufferReader to read it from
memory. As input to the
BufferReader you can either supply
bytes object or a
import pyarrow.io as paio import pyarrow.parquet as pq buf = ... # either bytes or paio.Buffer reader = paio.BufferReader(buf) table = pq.read_table(reader)
import pyarrow as pa import pyarrow.parquet as pq table = pa.Table(..) pq.write_table(table, '<filename>')
By default this will write the Table as a single RowGroup using
encoding. To increase the potential of parallelism a query engine can process
a Parquet file, set the
chunk_size to a fraction of the total number of rows.
If you also want to compress the columns, you can select a compression
method using the
compression argument. Typically,
GZIP is the choice if
you want to minimize size and
SNAPPY for performance.
Instead of writing to a file, you can also write to Python
import pyarrow.io as paio import pyarrow.parquet as pq table = ... output = paio.InMemoryOutputStream() pq.write_table(table, output) pybytes = output.get_result().to_pybytes()