导出 Apache Parquet - APOC Extended 文档

库要求

Apache Parquet 过程依赖于 APOC Extended 库中未包含的客户端库。

这些依赖项包含在 apoc-hadoop-dependencies-2025.10.0-all.jar 中，可从发布页面下载。

下载该文件后，应将其放入 plugins 目录并重启 Neo4j 服务器。

可用过程

下表描述了可用的过程：

名称	描述
apoc.export.parquet.all	将整个数据库导出为 Parquet 字节数组
apoc.export.parquet.data	将指定的节点和关系导出为 Parquet 字节数组
apoc.export.parquet.graph	将指定的图数据导出为 Parquet 字节数组
apoc.export.parquet.query	将指定的 Cypher 查询结果导出为 Parquet 字节数组
apoc.export.parquet.all.stream	将整个数据库导出为 Parquet 文件
apoc.export.parquet.data.stream	将指定的节点和关系导出为 Parquet 文件
apoc.export.parquet.graph.stream	将指定的图数据导出为 Parquet 文件
apoc.export.parquet.query.stream	将指定的 Cypher 查询结果导出为 Parquet 文件

名称

描述

apoc.export.parquet.all

将整个数据库导出为 Parquet 字节数组

apoc.export.parquet.data

将指定的节点和关系导出为 Parquet 字节数组

apoc.export.parquet.graph

将指定的图数据导出为 Parquet 字节数组

apoc.export.parquet.query

将指定的 Cypher 查询结果导出为 Parquet 字节数组

apoc.export.parquet.all.stream

将整个数据库导出为 Parquet 文件

apoc.export.parquet.data.stream

将指定的节点和关系导出为 Parquet 文件

apoc.export.parquet.graph.stream

将指定的图数据导出为 Parquet 文件

apoc.export.parquet.query.stream

将指定的 Cypher 查询结果导出为 Parquet 文件

我们可以通过使用这些过程之一来导入或加载导出的结果。

配置参数

该过程支持以下配置参数

表 1. 配置参数
名称 (name)	type	默认	description（描述）
batchSize	long	20000	用于每 n 个结果更新一次 parquet 文件 / 字节数组
mapping（映射）	Map	20000	用于映射复杂文件。请参阅下方的 `映射配置` 部分

使用

本节中的示例基于以下示例图

CREATE (TheMatrix:Movie {title:'The Matrix', released:1999, tagline:'Welcome to the Real World'})
CREATE (Keanu:Person {name:'Keanu Reeves', born:1964})
CREATE (Carrie:Person {name:'Carrie-Anne Moss', born:1967})
CREATE (Laurence:Person {name:'Laurence Fishburne', born:1961})
CREATE (Hugo:Person {name:'Hugo Weaving', born:1960})
CREATE (LillyW:Person {name:'Lilly Wachowski', born:1967})
CREATE (LanaW:Person {name:'Lana Wachowski', born:1965})
CREATE (JoelS:Person {name:'Joel Silver', born:1952})
CREATE
(Keanu)-[:ACTED_IN {roles:['Neo']}]->(TheMatrix),
(Carrie)-[:ACTED_IN {roles:['Trinity']}]->(TheMatrix),
(Laurence)-[:ACTED_IN {roles:['Morpheus']}]->(TheMatrix),
(Hugo)-[:ACTED_IN {roles:['Agent Smith']}]->(TheMatrix),
(LillyW)-[:DIRECTED]->(TheMatrix),
(LanaW)-[:DIRECTED]->(TheMatrix),
(JoelS)-[:PRODUCED]->(TheMatrix);

以下查询将整个数据库导出到 Parquet 文件 test.parquet 中

CALL apoc.export.parquet.all('test.parquet')

表 2. 结果
file	source	format	节点	relationships	属性	time	rows	batchSize	batches	data
"file:///test.parquet"	"graph: nodes(8), rels(7)"	"parquet"	8	7	0	0	0	20000	0	null

以下过程将指定的图数据导出到 Parquet 文件 testData.parquet 中

MATCH (n:Person)-[r]->()
WITH collect(n) as nodes, collect(r) as rels
call apoc.export.parquet.data(nodes, rels, 'testData.parquet')
YIELD file RETURN file

表 3. 结果
file
"file:///testData.parquet"

以下过程将指定的节点和关系导出为 Parquet 文件

CALL apoc.graph.fromDB('neo4j',{}) YIELD graph
CALL apoc.export.parquet.graph(graph, 'testGraph.parquet')
YIELD file RETURN file

表 4. 结果
file
"file:///testGraph.parquet"

以下过程将指定的查询结果导出为 Parquet 文件

CALL apoc.export.parquet.query("MATCH (n:Person) RETURN n", 'testQuery.parquet')

表 5. 结果
file	source	format	节点	relationships	属性	time	rows	batchSize	batches	data
"file:///testQuery.parquet"	"statement: cols(1)"	"parquet"	8	7	0	0	0	20000	0	null

我们也可以使用 apoc.export.parquet.<type>.stream 过程直接将 Parquet 字节数组作为结果导出，例如

CALL apoc.export.parquet.all.stream

表 6. 结果
值
<byte_array_parquet_file>