Neo4j数据库与图数据挖掘算法结合
一. 环境准备
-
neo4j python包:
pip3 install neo4j
和pip3 install py2neo
(这里的py2neo 是python对Neo4j的驱动库,同时这里必须是py2neo版本必须是最新版4,不然会报连接数据库的错误,老版本不兼容的问题) -
Java8:这里由于neo4j 数据库是依赖于java8的。
-
Neo4j_3.5.14:这里由于neo4j 在中国地区下载慢,并且neo4j3.X版本才支持java8,到4.0版本就是需要java11了。
-
Neo4j_Desktop:neo4j的桌面端(可以远程数据库和连接本地数据库,同时包含很多额外的扩展)
二. 连接本地图数据库
py2neo V4 官方文档:https://py2neo.org/v4/index.html
Neo4j 一共有3种连接方式:
- Bolt:bolt://localhost:11005
- HTTP:http://localhost:11006
- HTTPS:https://localhost:11007
这里可以通过Neo4j Desktop来查看新建的图数据库(同时设置密码)
2.1 Neo4j数据库语法Cypher
- 创建
create (:Movie {title:"ABC",released:2016}) return p;
- 查询
match (p: Person) return p; 查询Person类型的所有数据
match (p: Person {name:"sun"}) return p; 查询名字等于sun的人
match( p1: Person {name:"sun"} )-[rel:friend]->(p2) return p2.name , p2.age 查询sun的朋友的名字和年龄
match (old) ... create (new) create (old)-[rel:dr]->(new) return new 对已经存在的节点和新建的节点建立关系
- 更新
MERGE (m:Movie { title:"Cloud Atlas" }) ON CREATE SET m.released = 2012 RETURN m
- 筛选过滤
match (p1: Person)-[r:friend]->(p2: Person) where p1.name=~"K.+" or p2.age=24 or "neo" in r.rels return p1,r,p2
- 聚合函数(支持count,sum,avg,min,max)
MATCH (actor:Person)-[:ACTED_IN]->(movie:Movie)<-[:DIRECTED]-(director:Person) RETURN actor,director,count(*) AS collaborations
- 排序和分页
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
RETURN a,count(*) AS appearances
ORDER BY appearances DESC SKIP 3 LIMIT 10;
2.2 图数据库的基本操作py
- 这里是通过导入py2neo这个neo4j的第三方库来连接
from py2neo import Graph,Node
graph = Graph(
"http://localhost:11006",
username="neo4j",
password="yy"
)
- 清空数据库
graph.delete_all()
- 定义节点关系
a = Node('Person', name='Alice')
b = Node('Person', name='Bob')
r = Relationship(a, 'KNOWNS', b)
s = a | b | r
graph.create(s)
- Node查询
# 用CQL进行查询,返回的结果是list
data1 = graph.data('MATCH(p:PersonTest) return p')
print("data1 = ", data1, type(data1))
# 用find_one()方法进行node查找,返回的是查找node的第一个node
data2 = graph.find_one(label='PersonTest', property_key='name', property_value="李四")
print ("data2 = ", data2, type(data2))
# 用find()方法进行node查找,需要遍历输出,类似于mongodb
data3 = graph.find(label='PersonTest')
for data in data3:
print ("data3 = ", data)
- 关系查询
relationship = graph.match_one(rel_type='KNOWNS')
print (relationship, type(relationship))
- 更新push
node1 = graph.find_one(label='PersonTest', property_key='name', property_value="张三")
node1['age'] = 21
graph.push(node1)
data4 = graph.find(label='PersonTest')
for data in data4:
print ("data4 = ", data)
#基于上面的操作,再次定义node1[‘age’] = 99,并执行graph.push(node1),发现已经更新
node1['age'] = 99
graph.push(node1)
data5 = graph.find(label='PersonTest')
for data in data5:
print ("data5 = ", data)
- 删除Node和Relationship
node = graph.find_one(label='PersonTest', property_key='name', property_value="李四")
relationship = graph.match_one(rel_type='KNOWNS')
graph.delete(relationship)
graph.delete(node)
data6 = graph.find(label='PersonTest')
for data in data6:
print ("data6 = ", data)
- 多条件查询
a = Node('PersonTest', name='张三', age=21, location='广州')
b = Node('PersonTest', name='李四', age=22, location='上海')
c = Node('PersonTest', name='王五', age=21, location='北京')
r1 = Relationship(a, 'KNOWS', b)
r2 = Relationship(b, 'KNOWS', c)
s = a | b | c | r1 | r2
graph.create(s)
data7 = graph.find(label='PersonTest')
for data in data7:
print ("data7 = ", data)
- 单条件查询
# 单条件查询,返回的是多个结果
selector = NodeSelector(graph)
persons = selector.select('PersonTest', age=21)
print("data8 = ", list(persons))
- 多条件查询
selector = NodeSelector(graph)
persons = selector.select('PersonTest', age=21, location='广州')
print("data9 = ", list(persons))
- 复杂查询orderby
# orderby进行更复杂的查询
selector = NodeSelector(graph)
persons = selector.select('PersonTest').order_by('_.age')
for data in persons:
print ("data10 = ", data)
三. 中心性算法实验(社区算法)
3.1 中心性算法
- 度中心性:度中心性是最简单度量,即为某个节点在网络中的联结数。
MATCH (c:Character)-[:INTERACTS]-() RETURN c.name AS character, count(*) AS degree ORDER BY degree DESC
- 加权度中心性:指的是每个节点的权重后的中心性
MATCH (c:Character)-[r:INTERACTS]-() RETURN c.name AS character, sum(r.weight) AS weightedDegree ORDER BY weightedDegree DESC
- 介数中心性:在网络中,一个节点的介数中心性是指其它两个节点的所有最短路径都经过这个节点,则这些所有最短路径数即为此节点的介数中心性。
MATCH (c:Character)
WITH collect(c) AS characters
CALL apoc.algo.betweenness(['INTERACTS'], characters, 'BOTH') YIELD node, score
SET node.betweenness = score
RETURN node.name AS name, score ORDER BY score DESC
- 紧密度中心性:指到网络中所有其他角色的平均距离的倒数。
MATCH (c:Character) WITH collect(c) AS characters CALL apoc.algo.closeness(['INTERACTS'], characters, 'BOTH') YIELD node, score RETURN node.name AS name, score ORDER BY score DESC
3.2 PageRank 算法
PageRank算法源自Google的网页排名。它是一种特征向量中心性(eigenvector centrality)算法。
UNWIND {nodes} AS n
MATCH (c:Character) WHERE c.name = n.name
SET c.pagerank = n.pg
可以在Neo4j的图中查询最高PageRank值的节点:
MATCH (n:Character)
RETURN n.name AS name, n.pagerank AS pagerank ORDER BY pagerank DESC LIMIT 10