PyG 集成：采样与导出

此 Jupyter 笔记本托管于 here（Neo4j 图数据科学客户端的 Github 仓库）。

想要观看本笔记本的视频演示，请参阅在 NODES 2022 会议上发表的演讲 GNNs at Scale With Graph Data Science Sampling and Python Client Integration。

本笔记本演示了如何使用 graphdatascience 与 PyTorch Geometric（PyG）Python 库来

将 CORA 数据集直接导入 GDS
使用 GDS 随机游走重启（Random Walk with Restarts）算法对 CORA 进行抽样
在客户端导出 CORA 抽样结果
在 CORA 抽样子集上定义并训练图卷积神经网络（GCN）
在测试集上评估 GCN

1. 前置条件

运行本笔记本需要一台已安装近期 GDS 版本（2.5+）的 Neo4j 服务器。我们推荐使用带 GDS 的 Neo4j Desktop，或 AuraDS。

当然，还需要以下 Python 库：

graphdatascience（安装说明见文档）
PyG（安装说明见 PyG 文档）

2. 设置

首先，我们要导入依赖项并建立 GDS 客户端与数据库的连接。

此外，您可以使用 Aura Graph Analytics，并跳过下方整个“设置”部分。

# Install necessary dependencies
%pip install graphdatascience torch torch_scatter torch_sparse torch_geometric

import os
import random

import numpy as np
import torch
import torch.nn.functional as F
from torch_geometric.data import Data
from torch_geometric.nn import GCNConv
from torch_geometric.transforms import RandomNodeSplit

from graphdatascience import GraphDataScience

# Set seeds for consistent results
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)

# Get Neo4j DB URI, credentials and name from environment if applicable
NEO4J_URI = os.environ.get("NEO4J_URI", "bolt://:7687")
NEO4J_AUTH = None
NEO4J_DB = os.environ.get("NEO4J_DB", "neo4j")
if os.environ.get("NEO4J_USER") and os.environ.get("NEO4J_PASSWORD"):
    NEO4J_AUTH = (
        os.environ.get("NEO4J_USER"),
        os.environ.get("NEO4J_PASSWORD"),
    )
gds = GraphDataScience(NEO4J_URI, auth=NEO4J_AUTH, database=NEO4J_DB)

from graphdatascience import ServerVersion

assert gds.server_version() >= ServerVersion(2, 5, 0)

3. 采样 CORA

接下来我们使用内置的 CORA 加载器将数据导入 GDS，然后对其进行抽样，以获得一个用于训练的更小图。在实际场景中，通常会将 Neo4j 数据库中的数据投影到 GDS 中，而不是直接使用内置加载器。

G = gds.graph.load_cora()

# Let's make sure we constructed the correct graph
print(f"Metadata for our loaded Cora graph `G`: {G}")
print(f"Node labels present in `G`: {G.node_labels()}")

看起来正确！现在我们继续对图进行抽样。

# We use the random walk with restarts sampling algorithm with default values
G_sample, _ = gds.graph.sample.rwr("cora_sample", G, randomSeed=42, concurrency=1)

# We should have somewhere around 0.15 * 2708 ~ 406 nodes in our sample
print(f"Number of nodes in our sample: {G_sample.node_count()}")

# And let's see how many relationships we got
print(f"Number of relationships in our sample: {G_sample.relationship_count()}")

4. 导出抽样后的 CORA

现在我们可以导出抽样图的拓扑结构和节点属性，以供模型训练使用。

# Get the relationship data from our sample
sample_topology_df = gds.graph.relationships.stream(G_sample)

# Let's see what we got:
display(sample_topology_df)

我们获得了正确数量的行，每行对应一个预期的关系。但节点 ID 较大，而 PyG 需要从零开始的连续 ID。接下来我们将对包含关系的数据结构进行处理，使其符合 PyG 的要求。

# By using `by_rel_type` we get the topology in a format that can be used as input to several GNN frameworks:
# {"rel_type": [[source_nodes], [target_nodes]]}

sample_topology = sample_topology_df.by_rel_type()

# We should only have the "CITES" keys since there's only one relationship type
print(f"Relationship type keys: {sample_topology.keys()}")
print(f"Number of  {len(sample_topology['CITES'])}")

# How many source nodes do we have?
print(len(sample_topology["CITES"][0]))

太好了，看起来我们已经拥有创建 PyG edge_index 所需的格式。

# We also need to export the node properties corresponding to our node labels and features, represented by the
# "subject" and "features" node properties respectively
sample_node_properties = gds.graph.nodeProperties.stream(
    G_sample,
    ["subject", "features"],
    separate_property_columns=True,
)

# Let's make sure we got the data we expected
display(sample_node_properties)

5. 构建 GCN 输入

现在客户端已经获取了所有必要信息，我们可以构建用于训练的 PyG Data 对象。我们会将节点 ID 重新映射为从零开始的连续序号，并以 sample_node_properties 中节点 ID 的顺序作为映射依据，从而保证索引与节点属性保持一致。

# In order for the node ids used in the `topology` to be consecutive and starting from zero,
# we will need to remap them. This way they will also align with the row numbering of the
# `sample_node_properties` data frame
def normalize_topology_index(new_idx_to_old, topology):
    # Create a reverse mapping based on new idx -> old idx
    old_idx_to_new = dict((v, k) for k, v in new_idx_to_old.items())
    return [[old_idx_to_new[node_id] for node_id in nodes] for nodes in topology]


# We use the ordering of node ids in `sample_node_properties` as our remapping
# The result is: [[mapped_source_nodes], [mapped_target_nodes]]
normalized_topology = normalize_topology_index(dict(sample_node_properties["nodeId"]), sample_topology["CITES"])

# We use the ordering of node ids in `sample_node_properties` as our remapping
edge_index = torch.tensor(normalized_topology, dtype=torch.long)

# We specify the node property "features" as the zero-layer node embeddings
x = torch.tensor(sample_node_properties["features"], dtype=torch.float)

# We specify the node property "subject" as class labels
y = torch.tensor(sample_node_properties["subject"], dtype=torch.long)

data = Data(x=x, y=y, edge_index=edge_index)

print(data)

# Do a random split of the data so that ~10% goes into a test set and the rest used for training
transform = RandomNodeSplit(num_test=40, num_val=0)
data = transform(data)

# We can see that our `data` object have been extended with some masks defining the split
print(data)
print(data.train_mask.sum().item())

顺带一提，如果我们想进行超参数调优，保留一部分数据作为验证集会更有帮助。

6. 训练与评估 GCN

接下来使用 PyG 并以抽样后的 CORA 为输入，定义并训练 GCN。我们参考了 PyG 文档中的 CORA GCN 示例。

本例在抽样后 CORA 的测试集上评估模型。请注意，因 GCN 属于归纳式算法，也可以在完整的 CORA 数据集上，甚至在另一个（相似）图上进行评估。

num_classes = y.unique().shape[0]


# Define the GCN architecture
class GCN(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = GCNConv(data.num_node_features, 16)
        self.conv2 = GCNConv(16, num_classes)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index

        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, training=self.training)
        x = self.conv2(x, edge_index)

        # We use log_softmax and nll_loss instead of softmax output and cross entropy loss
        # for reasons for performance and numerical stability.
        # They are mathematically equivalent
        return F.log_softmax(x, dim=1)

# Prepare training by setting up for the chosen device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Let's see what device was chosen
print(device)

# In standard PyTorch fashion we instantiate our model, and transfer it to the memory of the chosen device
model = GCN().to(device)

# Let's inspect our model architecture
print(model)

# Pass our input data to the chosen device too
data = data.to(device)

# Since hyperparameter tuning is out of scope for this small example, we initialize an
# Adam optimizer with some fixed learning rate and weight decay
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

从模型结构可以看到，输出维度为 7，这符合事实——CORA 确实包含 7 类不同的论文主题。

# Train the GCN using the CORA sample represented by `data` using the standard PyTorch training loop
model.train()
for epoch in range(200):
    optimizer.zero_grad()
    out = model(data)
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

# Evaluate the trained GCN model on our test set
model.eval()
pred = model(data).argmax(dim=1)
correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
acc = int(correct) / int(data.test_mask.sum())

print(f"Accuracy: {acc:.4f}")

准确率表现良好。下一步可以将我们在子样本上训练好的 GCN 应用于完整的 CORA 图。此步骤留作练习。

7. 清理

我们从 GDS 图目录中删除 CORA 图。

_ = G_sample.drop()
_ = G.drop()