谷歌云平台 (GCP)
Google Cloud Platform 的 Natural Language API 允许用户利用 Google 的机器学习技术从非结构化文本中获取洞察。本章中的过程充当了调用此 API 的包装器,用于从存储在节点属性中的文本提取实体、类别或情感。
每个过程有两种模式
-
Stream(流模式) - 返回由 API 返回的 JSON 构建的映射(map)
-
Graph - 基于 API 返回的值创建图或虚拟图
|
本章中描述的过程会在调用线程中进行 API 调用并随后更新数据库。如果我们希望对 API 进行并行请求,并避免在运行写入数据库的过程时因在内存中保留过多事务状态而导致内存溢出错误,请参阅 批量请求 (Batching Requests)。 |
|
目前,GCP Natural Language API 支持超过 10 种语言的文本输入。为获得更好的结果,请确保您的文本属于 Natural Language API 支持的语言。如果输入了不支持的语言文本,您可能会收到“HTTP response code: 400”错误。 |
过程概述
过程描述如下
| 限定名称 | 类型 | 版本 |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
实体提取
实体提取过程 (apoc.nlp.gcp.entities.*) 是对 Google Natural Language API 的 documents.analyzeEntities 方法的封装。此 API 方法可以在文本中查找命名实体(当前为专有名词和普通名词),以及每个实体的实体类型、显著性 (salience)、提及信息和其他属性。
过程描述如下
| 签名 |
|---|
apoc.nlp.gcp.entities.stream(source :: ANY?, config = {} :: MAP?) :: (node :: NODE?, value :: MAP?, error :: MAP?) |
apoc.nlp.gcp.entities.graph(source :: ANY?, config = {} :: MAP?) :: (graph :: MAP?) |
该过程支持以下配置参数
| 名称 (name) | type | 默认 | description(描述) |
|---|---|---|---|
键 (key) |
字符串 |
null |
Google Natural Language API 的 API 密钥 |
nodeProperty |
字符串 |
文本 (text) |
提供的节点上包含待分析非结构化文本的属性 |
此外,apoc.nlp.gcp.entities.graph 支持以下配置参数
| 名称 (name) | type | 默认 | description(描述) |
|---|---|---|---|
scoreCutoff |
双精度浮点数 |
0.0 |
实体在图中出现所需的显著性分数下限。值必须在 0 到 1 之间。 显著性是衡量该实体对整个文档文本的重要度或中心性的指标。分数越接近 0,重要性越低;分数越接近 1.0,则越重要。 |
write |
布尔值 |
false |
持久化实体图 |
writeRelationshipType |
字符串 |
ENTITY |
从源节点到实体节点的关系类型 |
writeRelationshipProperty |
字符串 |
score |
从源节点到实体节点的关系属性 |
CALL apoc.nlp.gcp.entities.stream(source:Node or List<Node>, {
key: String,
nodeProperty: String
})
YIELD value
CALL apoc.nlp.gcp.entities.graph(source:Node or List<Node>, {
key: String,
nodeProperty: String,
scoreCutoff: Double,
writeRelationshipType: String,
writeRelationshipProperty: String,
write: Boolean
})
YIELD graph
分类
分类过程 (apoc.nlp.gcp.classify.*) 是对 Google Natural Language API 的 documents.classifyText 方法的封装。此 API 方法将文档分类为特定的类别。
过程描述如下
| 签名 |
|---|
apoc.nlp.gcp.classify.stream(source :: ANY?, config = {} :: MAP?) :: (node :: NODE?, value :: MAP?, error :: MAP?) |
apoc.nlp.gcp.classify.graph(source :: ANY?, config = {} :: MAP?) :: (graph :: MAP?) |
该过程支持以下配置参数
| 名称 (name) | type | 默认 | description(描述) |
|---|---|---|---|
键 (key) |
字符串 |
null |
Google Natural Language API 的 API 密钥 |
nodeProperty |
字符串 |
文本 (text) |
提供的节点上包含待分析非结构化文本的属性 |
此外,apoc.nlp.gcp.classify.graph 支持以下配置参数
| 名称 (name) | type | 默认 | description(描述) |
|---|---|---|---|
scoreCutoff |
双精度浮点数 |
0.0 |
类别在图中出现所需的置信度分数下限。值必须在 0 到 1 之间。 置信度是一个数值,表示分类器确定该类别代表给定文本的程度。 |
write |
布尔值 |
false |
持久化实体图 |
writeRelationshipType |
字符串 |
CATEGORY |
从源节点到类别节点的关系类型 |
writeRelationshipProperty |
字符串 |
score |
从源节点到类别节点的关系属性 |
CALL apoc.nlp.gcp.classify.stream(source:Node or List<Node>, {
key: String,
nodeProperty: String
})
YIELD value
CALL apoc.nlp.gcp.classify.graph(source:Node or List<Node>, {
key: String,
nodeProperty: String,
scoreCutoff: Double,
writeRelationshipType: String,
writeRelationshipProperty: String,
write: Boolean
})
YIELD graph
安装依赖项
NLP 过程依赖于 Kotlin 和客户端库,这些库未包含在 APOC Extended 库中。
这些依赖项包含在 apoc-nlp-dependencies-2025.10.0-all.jar 中,可从 发布页面 下载。下载该文件后,应将其放入 plugins 目录并重启 Neo4j 服务器。
设置 API 密钥
您可以前往 console.cloud.google.com/apis/credentials 生成有权访问 Cloud Natural Language API 的 API 密钥。创建密钥后,我们可以填充并执行以下命令来创建一个包含这些详细信息的参数。
apiKey 参数:param apiKey => ("<api-key-here>")
或者,我们可以将这些凭据添加到 apoc.conf 中,并使用静态值存储函数加载它们。
apoc.static.gcp.apiKey=<api-key-here>
apoc.conf 中检索 GCP 凭据RETURN apoc.static.getAll("gcp") AS gcp;
| gcp |
|---|
{apiKey: "<api-key-here>"} |
批量请求
对 GCP API 的请求批量处理及结果处理可以使用 Periodic Iterate(周期性迭代)来完成。如果我们希望对 GCP API 进行并行请求,并减少运行写入数据库的过程时在内存中保留的事务状态量,这种方法非常有用。
CALL apoc.periodic.iterate("
MATCH (n)
WITH collect(n) as total
CALL apoc.coll.partition(total, 25)
YIELD value as nodes
RETURN nodes", "
CALL apoc.nlp.gcp.entities.graph(nodes, {
key: $apiKey,
nodeProperty: 'body',
writeRelationshipType: 'GCP_ENTITY',
write:true
})
YIELD graph
RETURN distinct 'done'", {
batchSize: 1,
params: { apiKey: $apiKey }
}
);
示例
本节中的示例基于以下示例图
CREATE (:Article {
uri: "/blog/pokegraph-gotta-graph-em-all/",
body: "These days I’m rarely more than a few feet away from my Nintendo Switch and I play board games, card games and role playing games with friends at least once or twice a week. I’ve even organised lunch-time Mario Kart 8 tournaments between the Neo4j European offices!"
});
CREATE (:Article {
uri: "https://en.wikipedia.org/wiki/Nintendo_Switch",
body: "The Nintendo Switch is a video game console developed by Nintendo, released worldwide in most regions on March 3, 2017. It is a hybrid console that can be used as a home console and portable device. The Nintendo Switch was unveiled on October 20, 2016. Nintendo offers a Joy-Con Wheel, a small steering wheel-like unit that a Joy-Con can slot into, allowing it to be used for racing games such as Mario Kart 8."
});
实体提取
让我们从提取 Article 节点中的实体开始。我们要分析的文本存储在节点的 body 属性中,因此我们需要通过 nodeProperty 配置参数指定它。
MATCH (a:Article {uri: "/blog/pokegraph-gotta-graph-em-all/"})
CALL apoc.nlp.gcp.entities.stream(a, {
key: $apiKey,
nodeProperty: "body"
})
YIELD value
UNWIND value.entities AS entity
RETURN entity;
| 实体 (entity) |
|---|
{name: "card games", salience: 0.17967656, metadata: {}, type: "CONSUMER_GOOD", mentions: [{type: "COMMON", text: {content: "card games", beginOffset: -1}}]} |
{name: "role playing games", salience: 0.16441391, metadata: {}, type: "OTHER", mentions: [{type: "COMMON", text: {content: "role playing games", beginOffset: -1}}]} |
{name: "Switch", salience: 0.143287, metadata: {}, type: "OTHER", mentions: [{type: "COMMON", text: {content: "Switch", beginOffset: -1}}]} |
{name: "friends", salience: 0.13336793, metadata: {}, type: "PERSON", mentions: [{type: "COMMON", text: {content: "friends", beginOffset: -1}}]} |
{name: "Nintendo", salience: 0.12601112, metadata: {mid: "/g/1ymzszlpz"}, type: "ORGANIZATION", mentions: [{type: "PROPER", text: {content: "Nintendo", beginOffset: -1}}]} |
{name: "board games", salience: 0.08861496, metadata: {}, type: "CONSUMER_GOOD", mentions: [{type: "COMMON", text: {content: "board games", beginOffset: -1}}]} |
{name: "tournaments", salience: 0.0603245, metadata: {}, type: "EVENT", mentions: [{type: "COMMON", text: {content: "tournaments", beginOffset: -1}}]} |
{name: "offices", salience: 0.034420907, metadata: {}, type: "LOCATION", mentions: [{type: "COMMON", text: {content: "offices", beginOffset: -1}}]} |
{name: "Mario Kart 8", salience: 0.029095741, metadata: {wikipedia_url: "https://en.wikipedia.org/wiki/Mario_Kart_8", mid: "/m/0119mf7q"}, type: "PERSON", mentions: [{type: "PROPER", text: {content: "Mario Kart 8", beginOffset: -1}}]} |
{name: "European", salience: 0.020393685, metadata: {mid: "/m/02j9z", wikipedia_url: "https://en.wikipedia.org/wiki/Europe"}, type: "LOCATION", mentions: [{type: "PROPER", text: {content: "European", beginOffset: -1}}]} |
{name: "Neo4j", salience: 0.020393685, metadata: {mid: "/m/0b76t3s", wikipedia_url: "https://en.wikipedia.org/wiki/Neo4j"}, type: "ORGANIZATION", mentions: [{type: "PROPER", text: {content: "Neo4j", beginOffset: -1}}]} |
{name: "8", salience: 0, metadata: {value: "8"}, type: "NUMBER", mentions: [{type: "TYPE_UNKNOWN", text: {content: "8", beginOffset: -1}}]} |
我们返回了 12 个不同的实体。然后我们可以应用一个 Cypher 语句,为每个实体创建一个节点,并从这些节点分别创建指向 Article 节点的 ENTITY 关系。
MATCH (a:Article {uri: "/blog/pokegraph-gotta-graph-em-all/"})
CALL apoc.nlp.gcp.entities.stream(a, {
key: $apiKey,
nodeProperty: "body"
})
YIELD value
UNWIND value.entities AS entity
MERGE (e:Entity {name: entity.name})
SET e.type = entity.type
MERGE (a)-[:ENTITY]->(e)
或者,我们可以使用图模式自动创建实体图。除了拥有 Entity 标签外,每个实体节点还会根据 type 属性的值获得另一个标签。默认情况下返回一个虚拟图。
MATCH (a:Article {uri: "/blog/pokegraph-gotta-graph-em-all/"})
CALL apoc.nlp.gcp.entities.graph(a, {
key: $apiKey,
nodeProperty: "body",
writeRelationshipType: "ENTITY"
})
YIELD graph AS g
RETURN g;
我们可以在 Pokemon 实体图 中看到虚拟图的 Neo4j Browser 可视化效果。
我们可以通过将节点列表传递给该过程,来为多个节点计算实体。
MATCH (a:Article)
WITH collect(a) AS articles
CALL apoc.nlp.gcp.entities.graph(articles, {
key: $apiKey,
nodeProperty: "body",
writeRelationshipType: "ENTITY"
})
YIELD graph AS g
RETURN g;
我们可以在 Pokemon 和 Nintendo Switch 实体图 中看到虚拟图的 Neo4j Browser 可视化效果。
在此可视化图中,我们还可以看到每个实体节点的分数。该分数表示该实体在整个文档中的重要性。我们可以使用 scoreCutoff 属性为分数指定最小截断值。
MATCH (a:Article)
WITH collect(a) AS articles
CALL apoc.nlp.gcp.entities.graph(articles, {
key: $apiKey,
nodeProperty: "body",
writeRelationshipType: "ENTITY",
scoreCutoff: 0.01
})
YIELD graph AS g
RETURN g;
我们可以在 重要性 >= 0.01 的 Pokemon 和 Nintendo Switch 实体图 中看到虚拟图的 Neo4j Browser 可视化效果。
如果我们对这个图感到满意并希望将其持久化到 Neo4j 中,可以通过指定 write: true 配置来实现。
HAS_ENTITY 关系
MATCH (a:Article)
WITH collect(a) AS articles
CALL apoc.nlp.gcp.entities.graph(articles, {
key: $apiKey,
nodeProperty: "body",
scoreCutoff: 0.01,
writeRelationshipType: "HAS_ENTITY",
writeRelationshipProperty: "gcpEntityScore",
write: true
})
YIELD graph AS g
RETURN g;
然后,我们可以编写一个查询来返回已创建的实体。
MATCH (article:Article)
RETURN article.uri AS article,
[(article)-[r:HAS_ENTITY]->(e) | {entity: e.text, score: r.gcpEntityScore}] AS entities;
| article | entities |
|---|---|
"/blog/pokegraph-gotta-graph-em-all/" |
[{score: 0.020393685, entity: "Neo4j"}, {score: 0.034420907, entity: "offices"}, {score: 0.0603245, entity: "tournaments"}, {score: 0.020393685, entity: "European"}, {score: 0.029095741, entity: "Mario Kart 8"}, {score: 0.12601112, entity: "Nintendo"}, {score: 0.13336793, entity: "friends"}, {score: 0.08861496, entity: "board games"}, {score: 0.143287, entity: "Switch"}, {score: 0.16441391, entity: "role playing games"}, {score: 0.17967656, entity: "card games"}] |
"https://en.wikipedia.org/wiki/Nintendo_Switch" |
[{score: 0.76108575, entity: "Nintendo Switch"}, {score: 0.07424594, entity: "Nintendo"}, {score: 0.015900765, entity: "home console"}, {score: 0.012772448, entity: "device"}, {score: 0.038113687, entity: "regions"}, {score: 0.07299799, entity: "Joy-Con Wheel"}] |
分类
现在让我们从 Article 节点中提取类别。我们要分析的文本存储在节点的 body 属性中,因此我们需要通过 nodeProperty 配置参数指定它。
MATCH (a:Article {uri: "/blog/pokegraph-gotta-graph-em-all/"})
CALL apoc.nlp.gcp.classify.stream(a, {
key: $apiKey,
nodeProperty: "body"
})
YIELD value
UNWIND value.categories AS category
RETURN category;
| category |
|---|
{name: "/Games", confidence: 0.91} |
我们只获得了一个类别。然后我们可以应用一个 Cypher 语句,为每个类别创建一个节点,并从这些节点分别创建指向 Article 节点的 CATEGORY 关系。
MATCH (a:Article {uri: "/blog/pokegraph-gotta-graph-em-all/"})
CALL apoc.nlp.gcp.classify.stream(a, {
key: $apiKey,
nodeProperty: "body"
})
YIELD value
UNWIND value.categories AS category
MERGE (c:Category {name: category.name})
MERGE (a)-[:CATEGORY]->(c)
或者,我们可以使用图模式自动创建类别图。除了拥有 Category 标签外,每个类别节点还会根据 type 属性的值获得另一个标签。默认情况下,返回一个虚拟图。
MATCH (a:Article {uri: "/blog/pokegraph-gotta-graph-em-all/"})
CALL apoc.nlp.gcp.classify.graph(a, {
key: $apiKey,
nodeProperty: "body",
writeRelationshipType: "CATEGORY"
})
YIELD graph AS g
RETURN g;
我们可以在 Pokemon 类别图 中看到虚拟图的 Neo4j Browser 可视化效果。
HAS_CATEGORY 关系MATCH (a:Article)
WITH collect(a) AS articles
CALL apoc.nlp.gcp.classify.graph(articles, {
key: $apiKey,
nodeProperty: "body",
writeRelationshipType: "HAS_CATEGORY",
writeRelationshipProperty: "gcpCategoryScore",
write: true
})
YIELD graph AS g
RETURN g;
然后,我们可以编写一个查询来返回已创建的实体。
MATCH (article:Article)
RETURN article.uri AS article,
[(article)-[r:HAS_CATEGORY]->(c) | {category: c.text, score: r.gcpCategoryScore}] AS categories;
| article | categories |
|---|---|
"/blog/pokegraph-gotta-graph-em-all/" |
[{category: "/Games", score: 0.91}] |
"https://en.wikipedia.org/wiki/Nintendo_Switch" |
[{category: "/Computers & Electronics/Consumer Electronics/Game Systems & Consoles", score: 0.99}, {category: "/Games/Computer & Video Games", score: 0.99}] |