基于 FastRP 嵌入的 kNN 产品推荐

此 Jupyter notebook 托管于 Neo4j Graph Data Science Client GitHub 仓库中。

本 notebook 展示了如何使用 graphdatascience Python 库来操作 Neo4j GDS。它演示了 GDS 手册中 FastRP 和 kNN 端到端示例的改编版本，可在此处查看：这里。

我们考虑一个包含产品和客户的图表，并希望为每位客户寻找新的推荐产品。我们希望利用 K-最近邻算法 (kNN) 来识别相似的客户，并以此为基础进行产品推荐。为了在 kNN 中利用图的拓扑信息，我们将首先使用 FastRP 创建节点嵌入。这些嵌入将作为 kNN 算法的输入。

然后，我们将使用 Cypher 查询为每对相似的客户生成推荐，其中一位客户购买的产品将被推荐给另一位客户。

1. 前置条件

运行此 notebook 需要一个安装了最新版本 (2.0+) GDS 的 Neo4j 服务器。我们建议使用安装了 GDS 的 Neo4j Desktop 或 AuraDS。

同时也需要安装 graphdatascience Python 库。请参阅下文“设置”部分中的示例以及客户端安装说明。

2. 设置

我们首先安装并导入依赖项，并设置与数据库的 GDS 客户端连接。

此外，您可以使用 Aura Graph Analytics，并跳过下方整个“设置”部分。

# Install necessary dependencies
%pip install graphdatascience

import os

from graphdatascience import GraphDataScience

# Get Neo4j DB URI and credentials from environment if applicable
NEO4J_URI = os.environ.get("NEO4J_URI", "bolt://:7687")
NEO4J_AUTH = None
if os.environ.get("NEO4J_USER") and os.environ.get("NEO4J_PASSWORD"):
    NEO4J_AUTH = (
        os.environ.get("NEO4J_USER"),
        os.environ.get("NEO4J_PASSWORD"),
    )

gds = GraphDataScience(NEO4J_URI, auth=NEO4J_AUTH)

from graphdatascience import ServerVersion

assert gds.server_version() >= ServerVersion(1, 8, 0)

3. 创建示例图

我们现在在数据库中创建一个包含产品和客户的图表。amount 关系属性表示客户在给定产品上每周平均花费的金额。

# The `run_cypher` method can be used to run arbitrary Cypher queries on the database.
_ = gds.run_cypher(
    """
        CREATE
         (dan:Person {name: 'Dan'}),
         (annie:Person {name: 'Annie'}),
         (matt:Person {name: 'Matt'}),
         (jeff:Person {name: 'Jeff'}),
         (brie:Person {name: 'Brie'}),
         (elsa:Person {name: 'Elsa'}),

         (cookies:Product {name: 'Cookies'}),
         (tomatoes:Product {name: 'Tomatoes'}),
         (cucumber:Product {name: 'Cucumber'}),
         (celery:Product {name: 'Celery'}),
         (kale:Product {name: 'Kale'}),
         (milk:Product {name: 'Milk'}),
         (chocolate:Product {name: 'Chocolate'}),

         (dan)-[:BUYS {amount: 1.2}]->(cookies),
         (dan)-[:BUYS {amount: 3.2}]->(milk),
         (dan)-[:BUYS {amount: 2.2}]->(chocolate),

         (annie)-[:BUYS {amount: 1.2}]->(cucumber),
         (annie)-[:BUYS {amount: 3.2}]->(milk),
         (annie)-[:BUYS {amount: 3.2}]->(tomatoes),

         (matt)-[:BUYS {amount: 3}]->(tomatoes),
         (matt)-[:BUYS {amount: 2}]->(kale),
         (matt)-[:BUYS {amount: 1}]->(cucumber),

         (jeff)-[:BUYS {amount: 3}]->(cookies),
         (jeff)-[:BUYS {amount: 2}]->(milk),

         (brie)-[:BUYS {amount: 1}]->(tomatoes),
         (brie)-[:BUYS {amount: 2}]->(milk),
         (brie)-[:BUYS {amount: 2}]->(kale),
         (brie)-[:BUYS {amount: 3}]->(cucumber),
         (brie)-[:BUYS {amount: 0.3}]->(celery),

         (elsa)-[:BUYS {amount: 3}]->(chocolate),
         (elsa)-[:BUYS {amount: 3}]->(milk)
    """
)

4. 投影到 GDS 中

为了能够分析数据库中的数据，我们将其投影到内存中，以便 GDS 在其上进行操作。

# We define how we want to project our database into GDS
node_projection = ["Person", "Product"]
relationship_projection = {"BUYS": {"orientation": "UNDIRECTED", "properties": "amount"}}

# Before actually going through with the projection, let's check how much memory is required
result = gds.graph.project.estimate(node_projection, relationship_projection)

print(f"Required memory for native loading: {result['requiredMemory']}")

# For this small graph memory requirement is low. Let us go through with the projection
G, result = gds.graph.project("purchases", node_projection, relationship_projection)

print(f"The projection took {result['projectMillis']} ms")

# We can use convenience methods on `G` to check if the projection looks correct
print(f"Graph '{G.name()}' node count: {G.node_count()}")
print(f"Graph '{G.name()}' node labels: {G.node_labels()}")

5. 创建 FastRP 节点嵌入

接下来，我们使用 FastRP 算法生成捕获图拓扑信息的节点嵌入。我们选择将 embeddingDimension 设置为 4，因为我们的示例图非常小，这已经足够了。iterationWeights 是根据经验选择的，以产生合理的结果。请参阅 FastRP 文档的语法部分以获取有关这些参数的更多信息。

由于我们希望在稍后运行 kNN 时将嵌入作为输入，因此我们使用 FastRP 的 mutate（变异）模式。

# We can also estimate memory of running algorithms like FastRP, so let's do that first
result = gds.fastRP.mutate.estimate(
    G,
    mutateProperty="embedding",
    randomSeed=42,
    embeddingDimension=4,
    relationshipWeightProperty="amount",
    iterationWeights=[0.8, 1, 1, 1],
)

print(f"Required memory for running FastRP: {result['requiredMemory']}")

# Now let's run FastRP and mutate our projected graph 'purchases' with the results
result = gds.fastRP.mutate(
    G,
    mutateProperty="embedding",
    randomSeed=42,
    embeddingDimension=4,
    relationshipWeightProperty="amount",
    iterationWeights=[0.8, 1, 1, 1],
)

# Let's make sure we got an embedding for each node
print(f"Number of embedding vectors produced: {result['nodePropertiesWritten']}")

6. 使用 kNN 进行相似度计算

现在我们可以运行 kNN，通过使用我们用 FastRP 生成的节点嵌入作为 nodeProperties 来识别相似节点。由于我们使用的是小型图，我们可以将 sampleRate 设置为 1，并将 deltaThreshold 设置为 0，而不必担心计算时间过长。concurrency 参数设置为 1（连同固定的 randomSeed）以获得确定性结果。请参阅 kNN 文档的语法部分以获取有关这些参数的更多信息。

请注意，我们将使用该算法的 write（写入）模式将属性和关系写回数据库，以便稍后可以使用 Cypher 对其进行分析。

# Run kNN and write back to db (we skip memory estimation this time...)
result = gds.knn.write(
    G,
    topK=2,
    nodeProperties=["embedding"],
    randomSeed=42,
    concurrency=1,
    sampleRate=1.0,
    deltaThreshold=0.0,
    writeRelationshipType="SIMILAR",
    writeProperty="score",
)

print(f"Relationships produced: {result['relationshipsWritten']}")
print(f"Nodes compared: {result['nodesCompared']}")
print(f"Mean similarity: {result['similarityDistribution']['mean']}")

正如我们所见，节点之间的平均相似度相当高。这是因为我们的示例较小，节点之间不存在导致许多相似 FastRP 节点嵌入的长路径。

7. 探索结果

现在让我们使用 Cypher 检查 kNN 的调用结果。我们可以使用 SIMILARITY 关系类型来过滤我们感兴趣的关系。由于我们只关心产品推荐引擎中人与人之间的相似性，因此我们确保只匹配带有 Person 标签的节点。

请参阅 Cypher 手册以了解如何使用 Cypher 的文档。

gds.run_cypher(
    """
        MATCH (p1:Person)-[r:SIMILAR]->(p2:Person)
        RETURN p1.name AS person1, p2.name AS person2, r.score AS similarity
        ORDER BY similarity DESCENDING, person1, person2
    """
)

我们的 kNN 结果表明，名为“Annie”和“Matt”的 Person 节点非常相似。查看这两个节点的 BUYS 关系，我们可以发现这一结论是有道理的。他们两人都购买了三种产品，其中两种相同（名为“Cucumber”和“Tomatoes”的 Product 节点），且购买金额相似。因此，我们可以对我们的方法抱有很高的信心。

8. 进行推荐

利用我们推导出的名为“Annie”和“Matt”的 Person 节点相似的信息，我们可以分别为他们提供产品推荐。由于他们很相似，我们可以假设仅由其中一人购买的产品，对于尚未购买该产品的另一个人来说也可能感兴趣。根据这一原则，我们可以使用简单的 Cypher 查询为名为“Matt”的 Person 导出产品推荐。

gds.run_cypher(
    """
        MATCH (:Person {name: "Annie"})-[:BUYS]->(p1:Product)
        WITH collect(p1) as products
        MATCH (:Person {name: "Matt"})-[:BUYS]->(p2:Product)
        WHERE not p2 in products
        RETURN p2.name as recommendation
    """
)

事实上，“Kale”正是名为“Annie”的人购买而名为“Matt”的人尚未购买的产品。

9. 清理工作

在结束之前，我们可以从 GDS 内存状态和数据库中清理示例数据。

# Remove our projection from the GDS graph catalog
G.drop()

# Remove all the example data from the database
_ = gds.run_cypher("MATCH (n) DETACH DELETE n")

10. 结论

通过使用两个 GDS 算法和一些基础的 Cypher，我们能够轻松地为我们小示例中的客户导出一些合理的推荐产品。

为了确保通过 kNN 获得图中每位客户与其他客户的相似度，我们可以尝试增加 topK 参数。