适用于非 Neo4j 数据源的 Aura 图分析

Open In Colab

此 Jupyter 笔记本托管于 此处,位于 Neo4j Graph Data Science Client 的 Github 仓库中。

该笔记本展示了如何使用 graphdatascience Python 库来创建、管理和使用 Aura Graph Analytics(AGA)会话。

我们以一个人物和水果的图为例,用作展示如何将 Pandas DataFrame 数据加载到 AGA 会话、运行算法并检查结果的简单示例。我们将覆盖所有管理操作:创建、列出和删除。

如果你使用 AuraDB,请参考 此示例。如果你使用自托管的 Neo4j 实例,请参考 此示例

1. 前置条件

此笔记本要求在你的 Neo4j Aura 项目中启用 Aura Graph Analytics 功能

你还需要安装 graphdatascience Python 库,版本为 1.15 或更高。

%pip install "graphdatascience>=1.15" python-dotenv "neo4j_viz[gds]"
from dotenv import load_dotenv

# This allows to load required secrets from `.env` file in local directory
# This can include Aura API Credentials. If file does not exist this is a noop.
load_dotenv(".env")

2. Aura API 凭据

管理 GDS 会话的入口点是 GdsSessions 对象,该对象需要创建 Aura API 凭据

import os

from graphdatascience.session import AuraAPICredentials, GdsSessions

# you can also use AuraAPICredentials.from_env() to load credentials from environment variables
api_credentials = AuraAPICredentials(
    client_id=os.environ["CLIENT_ID"],
    client_secret=os.environ["CLIENT_SECRET"],
    # If your account is a member of several project, you must also specify the project ID to use
    project_id=os.environ.get("PROJECT_ID", None),
)

sessions = GdsSessions(api_credentials=api_credentials)

3. 创建新会话

通过调用 sessions.get_or_create() 并传入以下参数来创建新会话:

  • 会话名称,允许您通过再次调用 get_or_create 重新连接到现有会话。

  • 会话内存大小。

  • 云区域位置。

  • 生存时间 (TTL),确保会话在设定的时间内未使用后自动删除,以避免产生额外费用。

有关参数的更多详细信息,请参阅 API 参考文档或手册。

from graphdatascience.session import AlgorithmCategory, CloudLocation, SessionMemory

# Estimate the memory needed for the GDS session
memory = sessions.estimate(
    node_count=20,
    relationship_count=50,
    algorithm_categories=[AlgorithmCategory.CENTRALITY, AlgorithmCategory.NODE_EMBEDDING],
)

print(f"Estimated memory: {memory}")

# Explicitly define the size of the session
memory = SessionMemory.m_2GB

# Specify your cloud location
cloud_location = CloudLocation("gcp", "europe-west1")

# You can find available cloud locations by calling
cloud_locations = sessions.available_cloud_locations()
print(f"Available locations: {cloud_locations}")
from datetime import timedelta

# Create a GDS session!
gds = sessions.get_or_create(
    # we give it a representative name
    session_name="people-and-fruits-standalone",
    memory=memory,
    ttl=timedelta(minutes=30),
    cloud_location=cloud_location,
)
# Verify the connectivity. Hints towards TLS or firewall issues if this fails directly after get_or_create
gds.verify_connectivity()

4. 列出会话

你可以使用 sessions.list() 查看每个已创建会话的详细信息。

from pandas import DataFrame

gds_sessions = sessions.list()

# for better visualization
DataFrame(gds_sessions)

5. 投射数据集

AGA 会话始终从空状态开始,没有任何数据。因此我们的第一步是将数据投射到会话中。在本示例中,我们将演示如何使用 Pandas DataFrames 完成此操作。

许多系统提供将数据读取到 Pandas DataFrames 的方法,使这些系统能够作为 AGA 的数据来源。为简化起见,我们将在本笔记本中手动定义所使用的 DataFrames。

import pandas as pd

people_df = pd.DataFrame(
    [
        {"nodeId": 0, "name": "Dan", "age": 18, "experience": 63, "hipster": 0},
        {"nodeId": 1, "name": "Annie", "age": 12, "experience": 5, "hipster": 0},
        {"nodeId": 2, "name": "Matt", "age": 22, "experience": 42, "hipster": 0},
        {"nodeId": 3, "name": "Jeff", "age": 51, "experience": 12, "hipster": 0},
        {"nodeId": 4, "name": "Brie", "age": 31, "experience": 6, "hipster": 0},
        {"nodeId": 5, "name": "Elsa", "age": 65, "experience": 23, "hipster": 0},
        {"nodeId": 6, "name": "Bobby", "age": 38, "experience": 4, "hipster": 1},
        {"nodeId": 7, "name": "John", "age": 4, "experience": 100, "hipster": 0},
    ]
)
people_df["labels"] = "Person"

fruits_df = pd.DataFrame(
    [
        {"nodeId": 8, "name": "Apple", "tropical": 0, "sourness": 0.3, "sweetness": 0.6},
        {"nodeId": 9, "name": "Banana", "tropical": 1, "sourness": 0.1, "sweetness": 0.9},
        {"nodeId": 10, "name": "Mango", "tropical": 1, "sourness": 0.3, "sweetness": 1.0},
        {"nodeId": 11, "name": "Plum", "tropical": 0, "sourness": 0.5, "sweetness": 0.8},
    ]
)
fruits_df["labels"] = "Fruit"

like_relationships = [(0, 8), (1, 9), (2, 10), (3, 10), (4, 9), (5, 11), (7, 11)]
likes_df = pd.DataFrame([{"sourceNodeId": src, "targetNodeId": trg} for (src, trg) in like_relationships])
likes_df["relationshipType"] = "LIKES"

knows_relationship = [(0, 1), (0, 2), (1, 2), (1, 3), (1, 4), (2, 5), (7, 3)]
knows_df = pd.DataFrame([{"sourceNodeId": src, "targetNodeId": trg} for (src, trg) in knows_relationship])
knows_df["relationshipType"] = "KNOWS"

6. 从 DataFrames 构建图

手持 DataFrames 后,下一步是基于它们构建图。我们使用 gds.graph.construct() 函数来完成此操作。

调用此函数后,我们会得到一个图对象,表示当前在 AGA 会话中存在的图。我们将把它作为输入传递给随后在图上运行的各种算法。

# Dropping `name` column as GDS does not support string properties
nodes = [people_df.drop(columns="name"), fruits_df.drop(columns="name")]
relationships = [likes_df, knows_df]

G = gds.graph.construct("people-fruits", nodes, relationships)
str(G)
# Let us visualize the projected graph
from neo4j_viz.gds import from_gds

VG = from_gds(gds, G)

# Concatenate the dataframes with name and nodeId columns
names = pd.concat([people_df[["nodeId", "name"]], fruits_df[["nodeId", "name"]]])
# Create a dictionary mapping nodeId to name
names_mapping = names.set_index("nodeId")["name"]

for node in VG.nodes:
    node.caption = names_mapping[node.id]

VG.render(initial_zoom=1.2)

7. 运行算法

你可以使用标准的 GDS Python 客户端 API 在已构建的图上运行算法。更多示例请参阅其他教程。

print("Running PageRank ...")
pr_result = gds.pageRank.mutate(G, mutateProperty="pagerank")
print(f"Compute millis: {pr_result['computeMillis']}")
print(f"Node properties written: {pr_result['nodePropertiesWritten']}")
print(f"Centrality distribution: {pr_result['centralityDistribution']}")

print("Running FastRP ...")
frp_result = gds.fastRP.mutate(
    G,
    mutateProperty="fastRP",
    embeddingDimension=8,
    featureProperties=["pagerank"],
    propertyRatio=0.2,
    nodeSelfInfluence=0.2,
)
print(f"Compute millis: {frp_result['computeMillis']}")
# stream back the results
result = gds.graph.nodeProperties.stream(G, ["pagerank", "fastRP"], separate_property_columns=True)

result

为了将每个 nodeId 解析为名称,我们可以将其与源数据框合并。

names = pd.concat([people_df, fruits_df])[["nodeId", "name"]]
result.merge(names, how="left")

8. 删除会话

分析完成后,你可以删除该会话。由于此示例未连接到 Neo4j 数据库,你需要自行确保算法结果已持久化。

删除会话将释放其所有关联资源,并停止产生费用。

# or gds.delete()
sessions.delete(session_name="people-and-fruits-standalone")
# let's also make sure the deleted session is truly gone:
sessions.list()