中心性算法

此 Jupyter Notebook 托管于 Neo4j 图数据科学客户端 Github 仓库的此处。

中心性算法用于理解图中特定节点的作用或影响力。本 Notebook 展示了如何使用 graphdatascience 库在航空旅行可达性网络数据集上应用中心性算法，该数据集可从此处下载。

本 Notebook 将展示如何对图数据集应用特征向量中心性（Eigenvector Centrality）、中介中心性（Betweenness Centrality）、度中心性（Degree Centrality）和接近中心性（Closeness Centrality）。

1. 设置

首先，我们要导入依赖项并建立 GDS 客户端与数据库的连接。

此外，您可以使用 Aura Graph Analytics，并跳过下方整个“设置”部分。

# Install necessary dependencies
%pip install graphdatascience pandas

import os

import pandas as pd

from graphdatascience import GraphDataScience

NEO4J_URI = os.environ.get("NEO4J_URI", "bolt://:7687")
NEO4J_AUTH = None
if os.environ.get("NEO4J_USER") and os.environ.get("NEO4J_PASSWORD"):
    NEO4J_AUTH = (
        os.environ.get("NEO4J_USER"),
        os.environ.get("NEO4J_PASSWORD"),
    )

gds = GraphDataScience(NEO4J_URI, auth=NEO4J_AUTH)

from graphdatascience import ServerVersion

assert gds.server_version() >= ServerVersion(1, 8, 0)

2. 导入数据集

我们首先将数据集导入为 pandas DataFrame。此处处理两个文件：reachability-meta.csv.gz 文件存储了城市名称及其相关信息，而 reachability.txt.gz 文件存储了图的边。如果预估航空旅行时间小于某个阈值，则城市 i 到城市 j 之间存在一条边。

nodes_info_df = pd.read_csv("https://snap.stanford.edu/data/reachability-meta.csv.gz", compression="gzip")
nodes_info_df.head()

routes_df = pd.read_csv(
    "https://snap.stanford.edu/data/reachability.txt.gz",
    sep=" ",
    skiprows=6,
    header=None,
    compression="gzip",
    names=["Origin", "Destination", "Weight"],
)
routes_df.head()

由于该图非常小，直接使用 Cypher UNWIND 查询是在数据库中创建图的最简单方法。

对于较大的图，可能需要更复杂的导入技术，如分批处理、neo4j-admin import 或 Arrow CREATE DATABASE。

gds.run_cypher(
    "UNWIND $nodes AS node CREATE (n:City {node_id: node.node_id, name: node.name, population: node.metro_pop})",
    params={"nodes": nodes_info_df.to_dict("records")},
)

gds.run_cypher(
    """
    UNWIND $rels AS rel
    MATCH (source:City {node_id: rel.Origin}), (target:City {node_id: rel.Destination})
    CREATE (source)-[:HAS_FLIGHT_TO]->(target)
    """,
    params={"rels": routes_df.to_dict("records")},
)

G, result = gds.graph.project("airline", "City", "HAS_FLIGHT_TO")

print(f"The projection took {result['projectMillis']} ms")

# We can use convenience methods on `G` to check if the projection looks correct
print(f"Graph '{G.name()}' node count: {G.node_count()}")
print(f"Graph '{G.name()}' node labels: {G.node_labels()}")
print(f"Graph '{G.name()}' relationship count: {G.relationship_count()}")

3. 特征向量中心性

特征向量中心性基于节点与网络中其他节点的连接程度来衡量其重要性或影响力。较高的特征向量中心性得分表明该节点在网络中更具中心地位且影响力更大。

对于我们的数据集，特征向量中心性可以帮助识别那些不仅自身连接良好，而且还连接到其他重要机场的机场。具有高特征向量中心性的节点很可能是主要枢纽或具有广泛连接性的机场。

eigenvector_centrality_result = gds.eigenvector.mutate(G, maxIterations=100, mutateProperty="eigenvectorCentrality")

# We can verify that the eigenvectorCentrality was mutated
G.node_properties()

我们可以使用下面的代码查看我们的实现是否收敛，如果收敛，还可以查看它所花费的迭代次数。

if eigenvector_centrality_result.didConverge:
    print(
        f"The number of iterations taken by Eigenvector Centrality to run is {eigenvector_centrality_result.ranIterations}."
    )
else:
    print("Algorithm did not converge!")

我们还可以使用下面的代码查看特征向量中心性度量的分布情况。这将向我们展示中心性度量的最小值、最大值、平均值和其他统计值。

eigenvector_centrality_result.centralityDistribution

现在我们将结果写回数据库。

gds.graph.nodeProperties.write(G, ["eigenvectorCentrality"])

利用特征向量中心性的结果，我们现在可以查找排名前 20 位的城市，这些城市拥有主要枢纽或具有广泛连接性的机场。

def display_top_20_cities(centrality_measure):
    """
    Function to execute the Cypher query to retrieve the top 20 cities with the highest centrality measure.
    """
    query = f"""
    MATCH (n:City)
    RETURN n.node_id AS node_id, n.name AS name, n.population AS population, n.{centrality_measure} AS {centrality_measure}
    ORDER BY n.{centrality_measure} DESC
    LIMIT 20
    """
    result = gds.run_cypher(query)

    # Display the result
    print(result)


display_top_20_cities("eigenvectorCentrality")

4. 中介中心性

中介中心性量化了节点作为网络中桥梁或中介的重要性。它衡量了一个节点出现在其他节点对之间的最短路径上的频率。

对于我们的数据集，具有高中介中心性的城市/机场充当了重要中转点或连接枢纽，连接了那些可能没有直飞航班的机场。它们在促进航空旅行流量方面发挥着重要作用，对于整体网络连接性至关重要。

betweenness_centrality_result = gds.betweenness.mutate(G, mutateProperty="betweennessCentrality")

# We can verify that the betweennessCentrality was mutated
G.node_properties()

我们同样可以使用下面的代码查看中介中心性度量的分布情况。这将向我们展示中心性度量的最小值、最大值、平均值和其他统计值。

betweenness_centrality_result.centralityDistribution

现在我们将结果写回数据库。

gds.graph.nodeProperties.write(G, ["betweennessCentrality"])

利用中介中心性的结果，我们现在可以查找排名前 20 位的城市，这些城市拥有的机场充当了重要中转点或连接枢纽，连接了那些可能没有直飞航班的机场。

display_top_20_cities("betweennessCentrality")

5. 度中心性

度中心性衡量一个节点在网络中拥有的连接数（边数）。

对于我们的数据集，具有高度中心性的城市拥有大量通往其他城市的直飞航班连接。它们代表了拥有许多直飞目的地或经常用于直航旅行的城市。度中心性提供了关于网络中各个机场的突出程度和连接性的见解。

degree_centrality_result = gds.degree.mutate(G, mutateProperty="degreeCentrality")

# We can verify that the degreeCentrality was mutated
G.node_properties()

与上述类似，我们也可以使用下面的代码查看度中心性度量的分布情况。这将向我们展示中心性度量的最小值、最大值、平均值和其他统计值。

degree_centrality_result.centralityDistribution

现在我们将结果写回数据库。

gds.graph.nodeProperties.write(G, ["degreeCentrality"])

最后，利用度中心性的结果，我们现在可以查找排名前 20 位的城市，这些城市拥有大量直飞航班。

display_top_20_cities("degreeCentrality")

6. 清理

在结束之前，我们可以从 GDS 内存状态和数据库中清理示例数据。

# Cleanup GDS
G.drop()

# Cleanup database
gds.run_cypher("MATCH (n:City) DETACH DELETE n")

7. 参考文献

关于网络：Brendan J. Frey 和 Delbert Dueck，“Clustering by passing messages between data points.” Science 315.5814 (2007): 972-976.
关于城市元数据（大都市人口、纬度和经度）：Austin R. Benson, David F. Gleich, 和 Jure Leskovec，“Higher-order Organization of Complex Networks.” Science, 353.6295 (2016): 163–166.
数据集链接：https://snap.stanford.edu/data/reachability.html
本 Notebook 由 Kedar Ghule 贡献