机器学习管道

Python 客户端对链接预测管道和节点属性预测管道提供了专门支持。在 GDS Python 客户端中，GDS 管道表现为管道对象。

通过客户端操作管道完全基于这些管道对象。与 Cypher 过程 API 相比，这是一种更方便、更符合 Python 习惯的 API。一旦创建，管道对象就可以作为参数传递给 Python 客户端中的各种方法，例如管道目录操作。此外，管道对象还提供了一些便捷方法，允许在不显式涉及管道目录的情况下检查所代表的管道。

在下面的示例中，我们假设已经实例化了一个名为 gds 的 GraphDataScience 对象。请在入门指南中阅读更多相关信息。

1. 节点分类

本节概述了如何使用 Python 客户端构建、配置和训练节点分类管道，以及如何使用训练生成的模型进行预测。

1.1. 管道

要创建新的节点分类管道，可以执行以下调用

pipe = gds.nc_pipe("my-pipe")

其中 pipe 是一个管道对象。

接着，为了构建、配置和训练管道，我们将直接在节点分类管道对象上调用方法。以下是此类对象方法的描述

表 1. 节点分类管道方法
名称	参数	返回类型	描述
`addNodeProperty`	`procedure_name: str, config: **kwargs`	`Series`	将生成节点属性的算法添加到管道中，并可选择算法特定的配置.
`selectFeatures`	`node_properties Union[str, list[str]]`	`Series`	选择要用作特征的节点属性.
`configureSplit`	`config: **kwargs`	`Series`	配置训练-测试数据集的拆分.
`addLogisticRegression`	`parameter_space dict[str, any]`	`Series`	在模型选择阶段添加逻辑回归模型配置以作为候选进行训练。 ^[1]
`addRandomForest`	`parameter_space dict[str, any]`	`Series`	在模型选择阶段添加随机森林模型配置以作为候选进行训练。 ^[1]
`addMLP`	`parameter_space dict[str, any]`	`Series`	在模型选择阶段添加 MLP 模型配置以作为候选进行训练。 ^[1]
`configureAutoTuning`	`config: **kwargs`	`Series`	配置自动调优.
`train`	`G: Graph, config: **kwargs`	`NCPredictionPipeline, Series`	使用给定的关键字参数在给定的输入图上训练管道.
`train_estimate`	`G: Graph, config: **kwargs`	`Series`	估算在给定的输入图上训练管道所需的资源.
`feature_properties`	`-`	`Series`	返回管道选定的特征属性列表。
`exists`	`-`	`bool`	如果模型存在于 GDS 管道目录中则返回 `True`，否则返回 `False`。
`名称 (name)`	`-`	`str`	管道在管道目录中显示的名称。
`type`	`-`	`str`	管道的类型。
`creation_time`	`-`	`neo4j.time.Datetime`	管道创建的时间。
`node_property_steps`	`-`	`DataFrame`	返回管道的节点属性步骤。
`split_config`	`-`	`Series`	返回为数据集的特征-训练-测试拆分设置的配置。
`parameter_space`	`-`	`Series`	返回训练时模型选择阶段所设置的模型参数空间。
`auto_tuning_config`	`-`	`Series`	返回为自动调优设置的配置。
`drop`	`failIfMissing: Optional[bool]`	`Series`	从 GDS 管道目录中删除管道。
1. 范围也可以作为长度为 2 的 `Tuple` 提供。例如，`(x, y)` 与 `{range: [x, y]}` 相同。

将上述方法与 Cypher API 的过程进行比较时，有两个主要区别：

由于 Python 方法是在管道对象上调用的，因此在调用它们时无需提供名称。
Cypher 调用中的配置参数在 Python 方法调用中由命名关键字参数表示。

另一个区别是，train Python 调用接收的是图对象而非图名称，并返回一个 NCModel 模型对象（我们可以用它进行预测），以及一个包含训练元数据的 pandas Series。

请参阅节点分类 Cypher 文档，了解这些方法期望的输入类型。

1.1.1. 示例

下面是一个关于如何配置和训练一个基础节点分类管道的小示例。注意，我们没有显式配置拆分，而是使用了默认值。

为了演示这一点，我们引入了一个小型人员图

gds.run_cypher(
  """
  CREATE
    (a:Person {name: "Bob", fraudster: 0}),
    (b:Person {name: "Alice", fraudster: 0}),
    (c:Person {name: "Eve", fraudster: 1}),
    (d:Person {name: "Chad", fraudster: 1}),
    (e:Person {name: "Dan", fraudster: 0}),
    (f:UnknownPerson {name: "Judy"}),

    (a)-[:KNOWS]->(b),
    (a)-[:KNOWS]->(c),
    (a)-[:KNOWS]->(d),
    (b)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(f),
    (e)-[:KNOWS]->(f)
  """
)
G, project_result = gds.graph.project("person_graph", {"Person": {"properties": ["fraudster"]}}, "KNOWS")

assert G.node_labels() == ["Person"]

pipe, _ = gds.beta.pipeline.nodeClassification.create("my-pipe")

# Add Degree centrality as a property step producing "rank" node properties
pipe.addNodeProperty("degree", mutateProperty="rank")

# Select our "rank" property as a feature for the model training
pipe.selectFeatures("rank")

# Verify that the features to be used in model training are what we expect
feature_properties = pipe.feature_properties()
assert len(feature_properties) == 1
assert feature_properties[0]["feature"] == "rank"

# Configure the model training to do cross-validation over logistic regression
pipe.addLogisticRegression(tolerance=(0.01, 0.1))
pipe.addLogisticRegression(penalty=1.0)

# Train the pipeline targeting node property "class" as label and "ACCURACY" as only metric
fraud_model, train_result = pipe.train(
    G,
    modelName="fraud-model",
    targetProperty="fraudster",
    metrics=["ACCURACY"],
    randomSeed=111
)
assert train_result["trainMillis"] >= 0

一个在 GDS 模型目录中称为 "fraud-model" 的模型被生成。在下一节中，我们将介绍如何使用该模型进行预测。

1.2. 模型

如上一节所示，节点分类模型是在训练节点分类管道时创建的。除了继承所有模型对象的通用方法外，节点分类模型还具有以下方法

表 2. 节点分类模型方法
名称	参数	返回类型	描述
`classes`	`-`	`List[int]`	用于训练分类模型的类别列表。
`feature_properties`	`-`	`List[str]`	用作输入模型特征的节点属性。
`node_property_steps`	`-`	`List[NodePropertyStep]`	管道在训练前用于计算节点属性的算法列表。
`metrics`	`-`	`Series`	训练时指定的指标值。
`best_parameters`	`-`	`Series`	在验证集中表现最好的训练方法参数。
`predict_mutate`	`G: Graph, config: **kwargs`	`Series`	预测输入图中节点的类别并使用预测结果变异图.
`predict_mutate_estimate`	`G: Graph, config: **kwargs`	`Series`	估算预测输入图节点类别并使用预测结果变异图所需的资源.
`predict_stream`	`G: Graph, config: **kwargs`	`DataFrame`	预测输入图中节点的类别并流式传输结果.
`predict_stream_estimate`	`G: Graph, config: **kwargs`	`Series`	估算预测输入图节点类别并流式传输结果所需的资源.
`predict_write`	`G: Graph, config: **kwargs`	`Series`	预测输入图中节点的类别并将结果写回数据库.
`predict_write_estimate`	`G: Graph, config: **kwargs`	`Series`	估算预测输入图节点类别并将结果写回数据库所需的资源.

可以注意到，这些预测方法确实与对应的 Cypher 方法非常相似。三个主要区别在于：

它们接收的是图对象而非图名称。
它们具有表示配置映射键的 Python 关键字参数。
由于使用的模型对象本身包含此信息，因此无需提供 "modelName"。

1.2.1. 示例（续）

我们现在继续上面的示例，使用我们在那里训练的节点分类模型 trained_pipe_model。

# Make sure we indeed obtained an accuracy score
metrics = fraud_model.metrics()
assert "ACCURACY" in metrics

H, project_result = gds.graph.project("full_person_graph", ["Person", "UnknownPerson"], "KNOWS")

# Predict on `H` and stream the results with a specific concurrency of 2
predictions = fraud_model.predict_stream(H, concurrency=2)
assert len(predictions) == H.node_count()

2. 链接预测

本节概述了如何使用 Python 客户端构建、配置和训练链接预测管道，以及如何使用训练产生的模型进行预测。

2.1. 管道

要创建新的链接预测管道，可以执行以下调用

pipe = gds.lp_pipe("my-pipe")

其中 pipe 是一个管道对象。

接着，为了构建、配置和训练管道，我们将直接在链接预测管道对象上调用方法。以下是此类对象方法的描述

表 3. 链接预测管道方法
名称	参数	返回类型	描述
`addNodeProperty`	`procedure_name: str, config: **kwargs`	`Series`	将生成节点属性的算法添加到管道中，并可选择算法特定的配置.
`addFeature`	`feature_type: str, config: **kwargs`	`Series`	基于节点属性和特征组合器添加用于模型训练的链接特征.
`configureSplit`	`config: **kwargs`	`Series`	配置特征-训练-测试数据集拆分.
`addLogisticRegression`	`parameter_space dict[str, any]`	`Series`	在模型选择阶段添加逻辑回归模型配置以作为候选进行训练。 ^[2]
`addRandomForest`	`parameter_space dict[str, any]`	`Series`	在模型选择阶段添加随机森林模型配置以作为候选进行训练。 ^[2]
`addMLP`	`parameter_space dict[str, any]`	`Series`	在模型选择阶段添加 MLP 模型配置以作为候选进行训练。 ^[2]
`configureAutoTuning`	`config: **kwargs`	`Series`	配置自动调优.
`train`	`G: Graph, config: **kwargs`	`LPPredictionPipeline, Series`	使用给定的关键字参数在给定的输入图上训练模型.
`train_estimate`	`G: Graph, config: **kwargs`	`Series`	估算在给定的输入图上训练管道所需的资源.
`feature_steps`	`-`	`DataFrame`	返回管道选定的特征步骤列表。
`exists`	`-`	`bool`	如果模型存在于 GDS 管道目录中则返回 `True`，否则返回 `False`。
`名称 (name)`	`-`	`str`	管道在管道目录中显示的名称。
`type`	`-`	`str`	管道的类型。
`creation_time`	`-`	`neo4j.time.Datetime`	管道创建的时间。
`node_property_steps`	`-`	`DataFrame`	返回管道的节点属性步骤。
`split_config`	`-`	`Series`	返回为数据集的特征-训练-测试拆分设置的配置。
`parameter_space`	`-`	`Series`	返回训练时模型选择阶段所设置的模型参数空间。
`auto_tuning_config`	`-`	`Series`	返回为自动调优设置的配置。
`drop`	`failIfMissing: Optional[bool]`	`Series`	从 GDS 管道目录中删除管道。
2. 范围也可以作为长度为 2 的 `Tuple` 提供。例如，`(x, y)` 与 `{range: [x, y]}` 相同。

将上述方法与 Cypher API 的过程进行比较时，有两个主要区别：

由于 Python 方法是在管道对象上调用的，因此在调用它们时无需提供名称。
Cypher 调用中的配置参数在 Python 方法调用中由命名关键字参数表示。

另一个区别是，train Python 调用接收的是图对象而非图名称，并返回一个 LPModel 模型对象（我们可以用它进行预测），以及一个包含训练元数据的 pandas Series。

请参阅链接预测 Cypher 文档，了解这些方法期望的输入类型。

2.1.1. 示例

下面是一个关于如何配置和训练一个基础链接预测管道的小示例。注意，我们没有显式配置训练参数，而是使用了默认值。

为了演示这一点，我们引入了一个小型人员图

gds.run_cypher(
  """
  CREATE
    (a:Person {name: "Bob"}),
    (b:Person {name: "Alice"}),
    (c:Person {name: "Eve"}),
    (d:Person {name: "Chad"}),
    (e:Person {name: "Dan"}),
    (f:Person {name: "Judy"}),

    (a)-[:KNOWS]->(b),
    (a)-[:KNOWS]->(c),
    (a)-[:KNOWS]->(d),
    (b)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(f),
    (e)-[:KNOWS]->(f)
  """
)
G, project_result = gds.graph.project("person_graph", "Person", {"KNOWS": {"orientation":"UNDIRECTED"}})

assert G.relationship_types() == ["KNOWS"]

pipe, _ = gds.beta.pipeline.linkPrediction.create("lp-pipe")

# Add FastRP as a property step producing "embedding" node properties
pipe.addNodeProperty("fastRP", embeddingDimension=128, mutateProperty="embedding", randomSeed=1337)

# Combine our "embedding" node properties with Hadamard to create link features for training
pipe.addFeature("hadamard", nodeProperties=["embedding"])

# Verify that the features to be used in model training are what we expect
steps = pipe.feature_steps()
assert len(steps) == 1
assert steps["name"][0] == "HADAMARD"

# Specify the fractions we want for our dataset split
pipe.configureSplit(trainFraction=0.2, testFraction=0.2, validationFolds=2)

# Add a random forest model with tuning over `maxDepth`
pipe.addRandomForest(maxDepth=(2, 20))

# Train the pipeline and produce a model named "friend-recommender"
friend_recommender, train_result = pipe.train(
    G,
    modelName="friend-recommender",
    targetRelationshipType="KNOWS",
    randomSeed=42
)
assert train_result["trainMillis"] >= 0

一个在 GDS 模型目录中称为 "my-model" 的模型被生成。在下一节中，我们将介绍如何使用该模型进行预测。

2.2. 模型

如上一节所示，链接预测模型是在训练链接预测管道时创建的。除了继承所有模型对象的通用方法外，链接预测模型还具有以下方法

表 4. 链接预测模型方法
名称	参数	返回类型	描述
`link_features`	`-`	`List[LinkFeature]`	用于训练模型的输入链接特征。
`node_property_steps`	`-`	`List[NodePropertyStep]`	管道在训练前用于计算节点属性的算法列表。
`metrics`	`-`	`Series`	训练时指定的指标值。
`best_parameters`	`-`	`Series`	在验证集中表现最好的训练方法参数。
`predict_mutate`	`G: Graph, config: **kwargs`	`Series`	预测输入图中非邻居节点之间的链接并使用预测结果变异图.
`predict_mutate_estimate`	`G: Graph, config: **kwargs`	`Series`	估算预测输入图中非邻居节点之间的链接并变异图所需的资源.
`predict_stream`	`G: Graph, config: **kwargs`	`DataFrame`	预测输入图中非邻居节点之间的链接并流式传输结果.
`predict_stream_estimate`	`G: Graph, config: **kwargs`	`Series`	估算预测输入图中非邻居节点之间的链接并流式传输结果所需的资源.

可以注意到，这些预测方法确实与对应的 Cypher 方法非常相似。三个主要区别在于：

它们接收的是图对象而非图名称。
它们具有表示配置映射键的 Python 关键字参数。
由于使用的模型对象本身包含此信息，因此无需提供 "modelName"。

2.2.1. 示例（续）

我们现在继续上面的示例，使用我们在那里训练的链接预测模型 trained_pipe_model。

# Make sure we indeed obtained an AUCPR score
metrics = friend_recommender.metrics()
assert "AUCPR" in metrics

# Predict on `G` and mutate it with the relationship predictions
mutate_result = friend_recommender.predict_mutate(G, topN=5, mutateRelationshipType="PRED_REL")
assert mutate_result["relationshipsWritten"] == 5 * 2  # Undirected relationships

3. 节点回归

本节概述了如何使用 Python 客户端构建、配置和训练节点回归管道，以及如何使用训练产生的模型进行预测。

3.1. 管道

要创建新的节点回归管道，可以执行以下调用

pipe = gds.nr_pipe("my-pipe")

其中 pipe 是一个管道对象。

接着，为了构建、配置和训练管道，我们将直接在节点回归管道对象上调用方法。以下是此类对象方法的描述

表 5. 节点回归管道方法
名称	参数	返回类型	描述
`addNodeProperty`	`procedure_name: str, config: **kwargs`	`Series`	将生成节点属性的算法添加到管道中，并可选择算法特定的配置.
`selectFeatures`	`node_properties Union[str, list[str]]`	`Series`	选择要用作特征的节点属性.
`configureSplit`	`config: **kwargs`	`Series`	配置训练-测试数据集的拆分.
`addLinearRegression`	`parameter_space dict[str, any]`	`Series`	在模型选择阶段添加线性回归模型配置以作为候选进行训练。 ^[3]
`addRandomForest`	`parameter_space dict[str, any]`	`Series`	在模型选择阶段添加随机森林模型配置以作为候选进行训练。 ^[3]
`configureAutoTuning`	`config: **kwargs`	`Series`	配置自动调优.
`train`	`G: Graph, config: **kwargs`	`NCPredictionPipeline, Series`	使用给定的关键字参数在给定的输入图上训练管道.
`feature_properties`	`-`	`Series`	返回管道选定的特征属性列表。
`exists`	`-`	`bool`	如果模型存在于 GDS 管道目录中则返回 `True`，否则返回 `False`。
`名称 (name)`	`-`	`str`	管道在管道目录中显示的名称。
`type`	`-`	`str`	管道的类型。
`creation_time`	`-`	`neo4j.time.Datetime`	管道创建的时间。
`node_property_steps`	`-`	`DataFrame`	返回管道的节点属性步骤。
`split_config`	`-`	`Series`	返回为数据集的特征-训练-测试拆分设置的配置。
`parameter_space`	`-`	`Series`	返回训练时模型选择阶段所设置的模型参数空间。
`auto_tuning_config`	`-`	`Series`	返回为自动调优设置的配置。
`drop`	`failIfMissing: Optional[bool]`	`Series`	从 GDS 管道目录中删除管道。
3. 范围也可以作为长度为 2 的 `Tuple` 提供。例如，`(x, y)` 与 `{range: [x, y]}` 相同。

将上述方法与 Cypher API 的过程进行比较时，有两个主要区别：

由于 Python 方法是在管道对象上调用的，因此在调用它们时无需提供名称。
Cypher 调用中的配置参数在 Python 方法调用中由命名关键字参数表示。

另一个区别是，train Python 调用接收的是图对象而非图名称，并返回一个 NRModel 模型对象（我们可以用它进行预测），以及一个包含训练元数据的 pandas Series。

请参阅节点回归 Cypher 文档，了解这些方法期望的输入类型。

3.1.1. 示例

下面是一个关于如何配置和训练一个基础节点回归管道的小示例。注意，我们没有显式配置拆分，而是使用了默认值。

为了演示这一点，我们引入了一个小型人员图

gds.run_cypher(
  """
  CREATE
    (a:Person {name: "Bob", age: 22}),
    (b:Person {name: "Alice", age: 5}),
    (c:Person {name: "Eve", age: 53}),
    (d:Person {name: "Chad", age: 44}),
    (e:Person {name: "Dan", age: 60}),
    (f:UnknownPerson {name: "Judy"}),

    (a)-[:KNOWS]->(b),
    (a)-[:KNOWS]->(c),
    (a)-[:KNOWS]->(d),
    (b)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(f),
    (e)-[:KNOWS]->(f)
  """
)
G, project_result = gds.graph.project("person_graph", {"Person": {"properties": ["age"]}}, "KNOWS")

assert G.relationship_types() == ["KNOWS"]

pipe, _ = gds.alpha.pipeline.nodeRegression.create("nr-pipe")

# Add Degree centrality as a property step producing "rank" node properties
pipe.addNodeProperty("degree", mutateProperty="rank")

# Select our "rank" property as a feature for the model training
pipe.selectFeatures("rank")

# Verify that the features to be used in model training are what we expect
feature_properties = pipe.feature_properties()
assert len(feature_properties) == 1
assert feature_properties[0]["feature"] == "rank"

# Configure the model training to do cross-validation over linear regression
pipe.addLinearRegression(tolerance=(0.01, 0.1))
pipe.addLinearRegression(penalty=1.0)

# Train the pipeline targeting node property "age" as label and "MEAN_SQUARED_ERROR" as only metric
age_predictor, train_result = pipe.train(
    G,
    modelName="age-predictor",
    targetProperty="age",
    metrics=["MEAN_SQUARED_ERROR"],
    randomSeed=42
)
assert train_result["trainMillis"] >= 0

一个在 GDS 模型目录中称为 "my-model" 的模型被生成。在下一节中，我们将介绍如何使用该模型进行预测。

3.2. 模型

如上一节所示，节点回归模型是在训练节点回归管道时创建的。除了继承所有模型对象的通用方法外，节点回归模型还具有以下方法

表 6. 节点回归模型方法
名称	参数	返回类型	描述
`feature_properties`	`-`	`List[str]`	返回用作输入模型特征的节点属性。
`node_property_steps`	`-`	`List[NodePropertyStep]`	管道在训练前用于计算节点属性的算法列表。
`metrics`	`-`	`Series`	训练时指定的指标值。
`best_parameters`	`-`	`Series`	在验证集中表现最好的训练方法参数。
`predict_mutate`	`G: Graph, config: **kwargs`	`Series`	预测输入图中节点的属性值并使用预测结果变异图.
`predict_stream`	`G: Graph, config: **kwargs`	`DataFrame`	预测输入图中节点的属性值并流式传输结果.

可以注意到，这些预测方法确实与对应的 Cypher 方法非常相似。三个主要区别在于：

它们接收的是图对象而非图名称。
它们具有表示配置映射键的 Python 关键字参数。
由于使用的模型对象本身包含此信息，因此无需提供 "modelName"。

3.2.1. 示例（续）

我们现在继续上面的示例，使用我们在那里训练的节点回归模型 age_predictor。假设我们有一个想要进行预测的新图 H。

# Make sure we indeed obtained an MEAN_SQUARED_ERROR score
metrics = age_predictor.metrics()
assert "MEAN_SQUARED_ERROR" in metrics

H, project_result = gds.graph.project("full_person_graph", ["Person", "UnknownPerson"], "KNOWS")

# Predict on `H` and stream the results with a specific concurrency of 2
predictions = age_predictor.predict_stream(H, concurrency=2)
assert len(predictions) == H.node_count()

4. 管道目录

使用管道对象的主要方式是训练模型。此外，管道对象还可以用作 GDS 管道目录操作的输入。例如，假设我们有一个管道对象 pipe，我们可以

exists_result = gds.pipeline.exists(pipe.name())

if exists_result["exists"]:
	gds.pipeline.drop(pipe)  # same as pipe.drop()

可以通过调用 get 方法并传入其名称来检索已经在管道目录中创建的管道对象。例如，我们可以从目录中列出所有管道，并使用找到的第一个管道名称来获取该管道的管道对象，这将是我们之前在示例中创建的 NodeClassification 管道。

list_result = gds.pipeline.list()
first_pipeline_name = list_result["pipelineName"][0]
pipe = gds.pipeline.get(first_pipeline_name)
assert pipe.name() == "my-pipe"