实体解析
2. 简介
如前所述,无论分析何种类型的数据,实体解析都是任何数据项目的关键环节。这包括:
-
客户
-
交易
-
产品
-
订单
-
地址
-
保单
-
产品申请
-
以及更多内容
每当需要人工在自由文本框中输入信息时,就可能出现数据不一致的情况。本指南旨在演示知识图谱如何独特地定位并协助解决这一问题。在此示例中,我们将重点关注地址去重,但同样的原则可以应用于您组织中的任何方面。
3. 建模
本节将展示示例图上的 Cypher 查询示例。目的是说明查询的结构,并提供关于如何在实际环境中构建数据的指南。我们将在一个包含少量节点的图上进行演示。示例图将基于以下数据模型:
3.2. 演示数据
以下 Cypher 语句将在 Neo4j 数据库中创建示例图:
// Create all Address Nodes
CREATE (:Address {`RegAddressAddressLine1`: "37 ALBYN PLACE", `RegAddressAddressLine2`: "ALBYN PLACE", RegAddressPostTown: "ABERDEEN", RegAddressPostCode: "AB101JB", FullAddress: "37 ALBYN PLACE ALBYN PLACE ABERDEEN AB101JB"})
CREATE (:Address {`RegAddressAddressLine1`: "COMPANY NAME", `RegAddressAddressLine2`: "37 ALBYN PLACE", RegAddressPostTown: "ABERDEEN", RegAddressPostCode: "AB101JB", FullAddress: "COMPANY NAME 37 ALBYN PLACE ABERDEEN AB101JB"});
// Update each Address Node with longitude and latitude
MATCH (a:Address)
CALL apoc.spatial.geocode(a.RegAddressPostCode) YIELD location
SET a.Latitude = location.latitude,
a.Longitude = location.longitude;
4. Cypher 查询
4.1. 计算地址之间的距离(单位:米)
此 Cypher 查询旨在根据地理坐标(经度和纬度)计算不同 Address 节点之间的距离。该查询的一个独特之处在于使用 point.distance 函数直接在查询中计算距离,并使用 ID(a1) > ID(a2) 来避免重复比较。
// Calculate the distance between Address Nodes
MATCH (a1:Address), (a2:Address)
WHERE ID(a1) > ID(a2)
RETURN a1.FullAddress AS FullAddress1, a2.FullAddress AS FullAddress2,
point.distance(point({ latitude: a1.Latitude, longitude: a1.Longitude }),
point({ latitude: a2.Latitude, longitude: a2.Longitude })) AS DistanceInMeters
4.1.1. 该查询的作用是什么?
-
MATCH (a1:Address), (a2:Address):查询的这部分匹配所有带有Address标签的节点。使用两个独立的变量a1和a2来表示这些Address节点。 -
WHERE ID(a1) > ID(a2):此条件确保查询不会将地址与自身进行比较,并基于 Neo4j 内部 ID 确保a1和a2是不同的,从而避免重复比较。 -
RETURN a1.FullAddress AS FullAddress1, a2.FullAddress AS FullAddress2:查询的这部分返回正在比较的两个节点的完整地址,并将其重命名为FullAddress1和FullAddress2,以便于理解。 -
point.distance(point({ latitude: a1.Latitude, longitude: a1.Longitude }), point({ latitude: a2.Latitude, longitude: a2.Longitude })) AS DistanceInMeters:这是查询的核心部分,用于计算两个地址节点之间的地理距离。-
point({ latitude: a1.Latitude, longitude: a1.Longitude })根据a1的纬度和经度构建一个点。 -
point({ latitude: a2.Latitude, longitude: a2.Longitude })对a2执行同样的操作。 -
然后使用
point.distance()计算这两个点之间以米为单位的距离。
-
4.2. 地址节点相似度评分
此复杂的 Cypher 查询旨在根据地址行和邮政编码等多个属性计算不同 Address 节点之间的相似度评分。该查询使用 APOC(Awesome Procedures On Cypher)库的 apoc.cypher.mapParallel2 函数并行执行相似度评分,从而提高性能。Levenshtein 算法用于衡量文本相似度,从而实现对地址字段的细致比较。该查询还结合了多层选择逻辑,以确保高质量的相似度匹配。
// Parallel Similarity Scoring Version
MATCH (a:Address)
WITH COLLECT(DISTINCT(left(a.RegAddressPostCode, 3))) AS postcodes
CALL apoc.cypher.mapParallel2("
MATCH (a:Address), (b:Address)
WHERE id(a) > id(b) AND a.RegAddressPostCode STARTS WITH _ AND b.RegAddressPostCode STARTS WITH _
// Pass Variables
WITH a, b,
// Build similarity scores
apoc.text.levenshteinSimilarity(a.RegAddressAddressLine1, b.RegAddressAddressLine1) AS line_1_sim,
apoc.text.levenshteinSimilarity(a.RegAddressAddressLine2, b.RegAddressAddressLine2) AS line_2_sim,
apoc.text.levenshteinSimilarity(a.RegAddressAddressLine1, b.RegAddressAddressLine2) AS a_b_line_1,
apoc.text.levenshteinSimilarity(a.RegAddressAddressLine2, b.RegAddressAddressLine1) AS b_a_line_1,
apoc.text.levenshteinSimilarity(a.RegAddressPostCode, b.RegAddressPostCode) AS post_sim,
apoc.text.levenshteinSimilarity(a.FullAddress, b.FullAddress) AS full_address_sim
WITH a, b, line_1_sim, line_2_sim, a_b_line_1, b_a_line_1, post_sim, full_address_sim, ((line_1_sim + line_2_sim) / 2) as add_1_2_calculation
// Selection logic //
// Limit the similarity of the full address
WHERE full_address_sim > 0.6
// Postcodes can not be too far apart
AND post_sim > 0.7
// Looks at addresses that have prefixes, e.g. 37 ALBYN PLACE vs COMPANY NAME 37 ALBYN PLACE
// This addition pushes the address into Line 2
AND ((line_1_sim = 1 OR a_b_line_1 = 1 OR b_a_line_1 = 1) AND post_sim > 0.85)
AND NOT (add_1_2_calculation > 0.6 AND full_address_sim > 0.91 AND post_sim > 0.9)
RETURN id(a) as a_id, a.FullAddress as a_FullAddress,id(b) as b_id, b.FullAddress as b_FullAddress, full_address_sim;
",
{parallel:True, batchSize:1000, concurrency:6}, postcodes, 6) YIELD value
RETURN value.a_id AS a_id, value.a_FullAddress AS a_full_address, value.b_id AS b_id, value.b_FullAddress AS b_full_address, value.full_address_sim AS full_address_similarity;
4.2.1. 该查询的作用是什么?
-
MATCH (a:Address):通过匹配所有带有 Address 标签的节点来启动查询。 -
WITH COLLECT(DISTINCT(left(a.RegAddressPostCode, 3))) AS postcodes:将这些邮政编码的前三个不同字符收集到一个名为 postcodes 的列表中。 -
CALL apoc.cypher.mapParallel2("…", {parallel:True, batchSize:1000, concurrency:6}, postcodes, 6) YIELD value:以 1000 的批处理大小和 6 的并发级别并行执行嵌套的 Cypher 查询。
嵌套查询详情
-
MATCH (a:Address), (b:Address):匹配所有用于比较的 Address 节点对。 -
WHERE id(a) > id(b) AND a.RegAddressPostCode STARTS WITH _ AND b.RegAddressPostCode STARTS WITH _:确保每一对都是唯一的,且两个地址的邮政编码均以 postcodes 列表中的内容开头。 -
Levenshtein 相似度计算: 利用
apoc.text.levenshteinSimilarity计算不同地址a和b属性之间的相似度。 -
选择逻辑: 应用各种条件来过滤结果。例如,它要求完整地址(full_address_sim > 0.6)和邮政编码(post_sim > 0.7)具有高度相似性。
-
RETURN id(a) as a_id, a.FullAddress as a_FullAddress, id(b) as b_id, b.FullAddress as b_FullAddress, full_address_sim;:返回a和b的 ID 及完整地址,以及完整地址的相似度评分。
通过结合先进的文本相似度算法和详细的选择逻辑,该查询非常适合捕捉地址之间细微的关系。
4.3. 在地址节点之间创建相似度关系
此 Cypher 查询旨在根据通过 Levenshtein 算法计算出的多个相似度评分,在 Address 节点之间创建类型为 SIMILAR_ADDRESS 的关系。值得注意的是,该查询使用 APOC 库的 apoc.text.levenshteinSimilarity 函数执行这些计算。它还采用了复杂的选择逻辑来过滤掉不符合特定相似度标准的记录。此查询特别针对地址共享公共前缀或地址行存在细微差异的情况。
// Create Similarity Relationship
MATCH (a:Address), (b:Address)
// Pass Variables
WITH a, b,
// Build similarity scores
apoc.text.levenshteinSimilarity(a.RegAddressAddressLine1, b.RegAddressAddressLine1) AS line_1_sim,
apoc.text.levenshteinSimilarity(a.RegAddressAddressLine2, b.RegAddressAddressLine2) AS line_2_sim,
apoc.text.levenshteinSimilarity(a.RegAddressAddressLine1, b.RegAddressAddressLine2) AS a_b_line_1,
apoc.text.levenshteinSimilarity(a.RegAddressAddressLine2, b.RegAddressAddressLine1) AS b_a_line_1,
apoc.text.levenshteinSimilarity(a.RegAddressPostCode, b.RegAddressPostCode) AS post_sim,
apoc.text.levenshteinSimilarity(a.FullAddress, b.FullAddress) AS full_address_sim
WITH a, b, line_1_sim, line_2_sim, a_b_line_1, b_a_line_1, post_sim, full_address_sim, ((line_1_sim + line_2_sim) / 2) as add_1_2_calculation
// Selection logic
// Limit the similarity of the full address
WHERE full_address_sim > 0.6
// Postcodes can not be too far apart
AND post_sim > 0.7
// Looks at addresses who have prefixes, e.g. 37 ALBYN PLACE vs COMPANY NAME 37 ALBYN PLACE
// This addition pushes the address into Line 2
AND ((line_1_sim = 1 OR a_b_line_1 = 1 OR b_a_line_1 = 1) AND post_sim > 0.85)
AND NOT (add_1_2_calculation > 0.6 AND full_address_sim > 0.91 AND post_sim > 0.9)
MERGE (a)-[:SIMILAR_ADDRESS {
full_address_similarity: full_address_sim,
postcode_similarity: post_sim,
line_2_similarity: line_2_sim,
line_1_similarity: line_1_sim,
line_1_2_similarity: a_b_line_1,
line_2_1_similarity: b_a_line_1
}]->(b);
4.3.1. 该查询的作用是什么?
-
MATCH (a:Address), (b:Address):查询首先匹配所有带有 Address 标签的节点,并用变量a和b表示。 -
WITH a, b, …:此子句将匹配到的a和b节点以及几个计算出的相似度评分传递给后续查询部分。 -
Levenshtein 相似度计算: 它利用
apoc.text.levenshteinSimilarity来计算 a 和 b 的各种属性(如地址行和邮政编码)之间的相似度评分。 -
WITH a, b, line_1_sim, …:查询保留原始节点和计算出的相似度评分以用于下一部分。 -
选择逻辑: 查询的此部分施加了多个过滤条件以优化相似度匹配。这些条件综合考虑了完整地址相似度、邮政编码相似度,甚至地址前缀,从而创建最有意义的关系。
-
MERGE (a)-[:SIMILAR_ADDRESS {…}]→(b);:最后,如果a和b满足条件,则在它们之间创建SIMILAR_ADDRESS关系。它还将计算出的相似度评分作为该关系的属性进行存储,以备后用。
通过结合先进的文本相似度算法和详细的选择逻辑,该查询非常适合捕捉地址之间细微的关系。