Reading/Writing Parquet files

If you have built pyarrow with Parquet support, i.e. parquet-cpp was found during the build, you can read files in the Parquet format to/from Arrow memory structures. The Parquet support code is located in the pyarrow.parquet module and your package needs to be built with the --with-parquet flag for build_ext.

Reading Parquet

To read a Parquet file into Arrow memory, you can use the following code snippet. It will read the whole Parquet file into memory as an Table.

import pyarrow.parquet as pq

table = pq.read_table('<filename>')

As DataFrames stored as Parquet are often stored in multiple files, a convenience method read_multiple_files() is provided.

If you already have the Parquet available in memory or get it via non-file source, you can utilize pyarrow.io.BufferReader to read it from memory. As input to the BufferReader you can either supply a Python bytes object or a pyarrow.io.Buffer.

import pyarrow.io as paio
import pyarrow.parquet as pq

buf = ... # either bytes or paio.Buffer
reader = paio.BufferReader(buf)
table = pq.read_table(reader)

Writing Parquet

Given an instance of pyarrow.table.Table, the most simple way to persist it to Parquet is by using the pyarrow.parquet.write_table() method.

import pyarrow as pa
import pyarrow.parquet as pq

table = pa.Table(..)
pq.write_table(table, '<filename>')

By default this will write the Table as a single RowGroup using DICTIONARY encoding. To increase the potential of parallelism a query engine can process a Parquet file, set the chunk_size to a fraction of the total number of rows.

If you also want to compress the columns, you can select a compression method using the compression argument. Typically, GZIP is the choice if you want to minimize size and SNAPPY for performance.

Instead of writing to a file, you can also write to Python bytes by utilizing an pyarrow.io.InMemoryOutputStream():

import pyarrow.io as paio
import pyarrow.parquet as pq

table = ...
output = paio.InMemoryOutputStream()
pq.write_table(table, output)
pybytes = output.get_result().to_pybytes()