适用于非 Neo4j 数据源的 Aura 图分析

此 Jupyter 笔记本托管于此处，位于 Neo4j Graph Data Science Client 的 Github 仓库中。

该笔记本展示了如何使用 graphdatascience Python 库来创建、管理和使用 Aura Graph Analytics（AGA）会话。

我们以一个人物和水果的图为例，用作展示如何将 Pandas DataFrame 数据加载到 AGA 会话、运行算法并检查结果的简单示例。我们将覆盖所有管理操作：创建、列出和删除。

如果你使用 AuraDB，请参考此示例。如果你使用自托管的 Neo4j 实例，请参考此示例。

1. 前置条件

此笔记本要求在你的 Neo4j Aura 项目中启用 Aura Graph Analytics 功能。

你还需要安装 graphdatascience Python 库，版本为 1.15 或更高。

%pip install "graphdatascience>=1.15" python-dotenv "neo4j_viz[gds]"

from dotenv import load_dotenv

# This allows to load required secrets from `.env` file in local directory
# This can include Aura API Credentials. If file does not exist this is a noop.
load_dotenv(".env")

2. Aura API 凭据

管理 GDS 会话的入口点是 GdsSessions 对象，该对象需要创建 Aura API 凭据。

import os

from graphdatascience.session import AuraAPICredentials, GdsSessions

# you can also use AuraAPICredentials.from_env() to load credentials from environment variables
api_credentials = AuraAPICredentials(
    client_id=os.environ["CLIENT_ID"],
    client_secret=os.environ["CLIENT_SECRET"],
    # If your account is a member of several project, you must also specify the project ID to use
    project_id=os.environ.get("PROJECT_ID", None),
)

sessions = GdsSessions(api_credentials=api_credentials)

3. 创建新会话

通过调用 sessions.get_or_create() 并传入以下参数来创建新会话：

会话名称，允许您通过再次调用 get_or_create 重新连接到现有会话。
会话内存大小。
云区域位置。
生存时间 (TTL)，确保会话在设定的时间内未使用后自动删除，以避免产生额外费用。

有关参数的更多详细信息，请参阅 API 参考文档或手册。

from graphdatascience.session import AlgorithmCategory, CloudLocation, SessionMemory

# Estimate the memory needed for the GDS session
memory = sessions.estimate(
    node_count=20,
    relationship_count=50,
    algorithm_categories=[AlgorithmCategory.CENTRALITY, AlgorithmCategory.NODE_EMBEDDING],
)

print(f"Estimated memory: {memory}")

# Explicitly define the size of the session
memory = SessionMemory.m_2GB

# Specify your cloud location
cloud_location = CloudLocation("gcp", "europe-west1")

# You can find available cloud locations by calling
cloud_locations = sessions.available_cloud_locations()
print(f"Available locations: {cloud_locations}")

from datetime import timedelta

# Create a GDS session!
gds = sessions.get_or_create(
    # we give it a representative name
    session_name="people-and-fruits-standalone",
    memory=memory,
    ttl=timedelta(minutes=30),
    cloud_location=cloud_location,
)

# Verify the connectivity. Hints towards TLS or firewall issues if this fails directly after get_or_create
gds.verify_connectivity()

4. 列出会话

你可以使用 sessions.list() 查看每个已创建会话的详细信息。

from pandas import DataFrame

gds_sessions = sessions.list()

# for better visualization
DataFrame(gds_sessions)

5. 投射数据集

AGA 会话始终从空状态开始，没有任何数据。因此我们的第一步是将数据投射到会话中。在本示例中，我们将演示如何使用 Pandas DataFrames 完成此操作。

许多系统提供将数据读取到 Pandas DataFrames 的方法，使这些系统能够作为 AGA 的数据来源。为简化起见，我们将在本笔记本中手动定义所使用的 DataFrames。

import pandas as pd

people_df = pd.DataFrame(
    [
        {"nodeId": 0, "name": "Dan", "age": 18, "experience": 63, "hipster": 0},
        {"nodeId": 1, "name": "Annie", "age": 12, "experience": 5, "hipster": 0},
        {"nodeId": 2, "name": "Matt", "age": 22, "experience": 42, "hipster": 0},
        {"nodeId": 3, "name": "Jeff", "age": 51, "experience": 12, "hipster": 0},
        {"nodeId": 4, "name": "Brie", "age": 31, "experience": 6, "hipster": 0},
        {"nodeId": 5, "name": "Elsa", "age": 65, "experience": 23, "hipster": 0},
        {"nodeId": 6, "name": "Bobby", "age": 38, "experience": 4, "hipster": 1},
        {"nodeId": 7, "name": "John", "age": 4, "experience": 100, "hipster": 0},
    ]
)
people_df["labels"] = "Person"

fruits_df = pd.DataFrame(
    [
        {"nodeId": 8, "name": "Apple", "tropical": 0, "sourness": 0.3, "sweetness": 0.6},
        {"nodeId": 9, "name": "Banana", "tropical": 1, "sourness": 0.1, "sweetness": 0.9},
        {"nodeId": 10, "name": "Mango", "tropical": 1, "sourness": 0.3, "sweetness": 1.0},
        {"nodeId": 11, "name": "Plum", "tropical": 0, "sourness": 0.5, "sweetness": 0.8},
    ]
)
fruits_df["labels"] = "Fruit"

like_relationships = [(0, 8), (1, 9), (2, 10), (3, 10), (4, 9), (5, 11), (7, 11)]
likes_df = pd.DataFrame([{"sourceNodeId": src, "targetNodeId": trg} for (src, trg) in like_relationships])
likes_df["relationshipType"] = "LIKES"

knows_relationship = [(0, 1), (0, 2), (1, 2), (1, 3), (1, 4), (2, 5), (7, 3)]
knows_df = pd.DataFrame([{"sourceNodeId": src, "targetNodeId": trg} for (src, trg) in knows_relationship])
knows_df["relationshipType"] = "KNOWS"

6. 从 DataFrames 构建图

手持 DataFrames 后，下一步是基于它们构建图。我们使用 gds.graph.construct() 函数来完成此操作。

调用此函数后，我们会得到一个图对象，表示当前在 AGA 会话中存在的图。我们将把它作为输入传递给随后在图上运行的各种算法。

# Dropping `name` column as GDS does not support string properties
nodes = [people_df.drop(columns="name"), fruits_df.drop(columns="name")]
relationships = [likes_df, knows_df]

G = gds.graph.construct("people-fruits", nodes, relationships)
str(G)

# Let us visualize the projected graph
from neo4j_viz.gds import from_gds

VG = from_gds(gds, G)

# Concatenate the dataframes with name and nodeId columns
names = pd.concat([people_df[["nodeId", "name"]], fruits_df[["nodeId", "name"]]])
# Create a dictionary mapping nodeId to name
names_mapping = names.set_index("nodeId")["name"]

for node in VG.nodes:
    node.caption = names_mapping[node.id]

VG.render(initial_zoom=1.2)

7. 运行算法

你可以使用标准的 GDS Python 客户端 API 在已构建的图上运行算法。更多示例请参阅其他教程。

print("Running PageRank ...")
pr_result = gds.pageRank.mutate(G, mutateProperty="pagerank")
print(f"Compute millis: {pr_result['computeMillis']}")
print(f"Node properties written: {pr_result['nodePropertiesWritten']}")
print(f"Centrality distribution: {pr_result['centralityDistribution']}")

print("Running FastRP ...")
frp_result = gds.fastRP.mutate(
    G,
    mutateProperty="fastRP",
    embeddingDimension=8,
    featureProperties=["pagerank"],
    propertyRatio=0.2,
    nodeSelfInfluence=0.2,
)
print(f"Compute millis: {frp_result['computeMillis']}")
# stream back the results
result = gds.graph.nodeProperties.stream(G, ["pagerank", "fastRP"], separate_property_columns=True)

result

为了将每个 nodeId 解析为名称，我们可以将其与源数据框合并。

names = pd.concat([people_df, fruits_df])[["nodeId", "name"]]
result.merge(names, how="left")

8. 删除会话

分析完成后，你可以删除该会话。由于此示例未连接到 Neo4j 数据库，你需要自行确保算法结果已持久化。

删除会话将释放其所有关联资源，并停止产生费用。

# or gds.delete()
sessions.delete(session_name="people-and-fruits-standalone")

# let's also make sure the deleted session is truly gone:
sessions.list()