精华 neo4j-load-csv
发布于 7 年前 作者 wkq278276130 4686 次浏览 最后一次编辑是 5 年前 来自 分享

Neo4j load csv 使用

本文转自 http://weikeqin.cn/2017/04/11/neo4j-load-csv/

LOAD CSV非常适合导入中小型数据,例如最高10M的记录范围。 对于大数据集,即在100B记录范围内,我们可以访问专门的批量导入程序。

官方实例

LOAD CSV FROM "http://data.neo4j.com/examples/person.csv" AS line
fieldterminator '\t'
MERGE (n:Person {id: toInt(line[0])})
SET n.name = line[1]
RETURN n

(1) 导入前注意事项

  1. csv一定要用 UTF-8无BOM编码,否则入库时可能乱码
  2. 默认使用;作为分隔符
  3. 导入数据前,一定要校验数据,都是坑呀,总结出的经验
  4. 建议使用绝对路径

// 使用\t作为分隔符 fieldterminator ‘\t’

(2) 检查csv一共多少行

(2.1) 相对路径格式

load csv from "file:/Test.csv" as line return count(*);

(2.2) windows下绝对路径格式

load csv from "file:///C:/User/wdb/2017-04-06_test.csv" as line return count(*);

(2.3) linux下绝对路径格式

load csv from "file:/home/usdp/databases/data/2017-04-06_test.csv" as line return count(*);

(3) 检查数据

(3.1) 检查数据 不带标题 注意:注意路径 使用UTF-8编码,否则可能导致乱码

load csv from "file:///C:/User/wdb/Test.csv"  as line  with line return line limit 5;

(3.2) 检查数据 带标题 注意:注意路径 使用UTF-8编码,否则可能导致乱码

load csv with headers from "file:///C:/User/wdb/Test.csv" as line with line return line limit 5;

(4) 真正导入数据

// 第一次字段用汉字,发现很慢

load csv with headers from "file:/Test.csv" as line with line create (:Person {姓名:line.姓名, 序号:toInt(line.序号), 联系方式:toInt(line.联系方式), 性别:line.性别, 班级:line.班级}) ;

// 第二次 优化后 没有用批量提交,也有点慢

load csv with headers from "file:/Test.csv" as line with line create (:Person {name:line.姓名, number:toInt(line.序号), tel:toInt(line.联系方式), sex:line.性别, class:line.班级}) ;

// 第三次 再次优化后 感觉比前两次快多了

using periodic commit 1000 load csv with headers from "file:///C:/User/wdb/Test.csv" as line with line create (:Person {name:line.姓名, number:toInt(line.序号), tel:toInt(line.联系方式), sex:line.性别, class:line.班级}) ;

(5) 注事事项

(5.1) 批量提交

数据量大了以后可以使用批量提交

using periodic commit 10000 表示 每10000行进行一次事务提交

USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
MERGE (pet:Pet {petId: row.PetId})
MERGE (owner:Owner {ownerId: row.OwnerId})
 ON CREATE SET owner.name = row.OwnerName
MERGE (pet)-[r:OWNED_BY]->(owner)

(5.2) 处理空值

处理空值

//skip null values
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
WITH row WHERE row.Company IS NOT NULL
MERGE (c:Company {companyId: row.Id})

//set default for null values
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
MERGE (c:Company {companyId: row.Id, hqLocation: coalesce(row.Location, "Unknown")})

//change empty strings to null values (not stored)
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
MERGE (c:Company {companyId: row.Id})
SET c.emailAddress = CASE trim(row.Email) WHEN "" THEN null ELSE row.Email END

(5.3) 使用split()

//split string of employee skills into separate nodes
LOAD CSV FROM 'file:///data.csv' AS row
MERGE (e:Employee {employeeId: row.Id})
UNWIND split(row.skills, ',') AS skill
MERGE (s:Skill {name: skill})
MERGE (e)-[r:HAS_EXPERIENCE]->(s);

(5.4) 使用CASE

//set businessType property based on shortened value in CSV
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
WITH row,
(CASE row.BusinessType
 WHEN 'P' THEN 'Public'
 WHEN 'R' THEN 'Private'
 WHEN 'G' THEN 'Government'
 ELSE 'Other' END) AS type
MERGE (c:Company {companyId: row.CompanyId})
SET c.businessType = type
RETURN *

(5.5) MERGE relation

为获得最佳性能,请始终在具有索引主键属性的单个标签上使用MATCH和MERGE。 您还应该将节点和关系创建分离为单独的语句。

以下两种写法都没问题,第一种相对第二种更占内存,第二种在内存有限的情况下性能会好一点

MERGE (e:Employee {employeeId: row.employeeId})
MERGE (c:Company {companyId: row.companyId})
MERGE (e)-[r:WORKS_FOR]->(c)

官方推荐在具体的Label的属性上建立索引使用MATCH和MERGE,并且把创建节点和创建关系分开写

LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
MERGE (e:Employee {employeeId: row.employeeId})
RETURN count(e);

LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
MERGE (c:Company {companyId: row.companyId})
RETURN count(c);

LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
MATCH (e:Employee {employeeId: row.employeeId})
MATCH (c:Company {companyId: row.companyId})
MERGE (e)-[r:WORKS_FOR]->(c)
RETURN count(*);

官方原文

To improve inserting or updating unique entities into your graph (using MERGE or MATCH with updates), you can create indexes and constraints declared for each of the labels and properties you plan to merge or match on.

For best performance, always MATCH and MERGE on a single label with the indexed primary-key property. You should also separate node and relationship creation into separate statements.

This way, the load is only doing one piece of the import at a time and can move through large amounts of data quickly and efficiently, reducing heavy processing.

When the amount of data being loaded is too much to fit into memory, there are a couple of different approaches you can use to combat running out of memory during the data load.

本文转自 http://weikeqin.cn/2017/04/11/neo4j-load-csv/

回到顶部