Graph mining
Graph mining is a process in which the mining techniques are used in finding a pattern or relationship in the given real-world collection of graphs. By mining the graph, frequent substructures and relationships can be identified which helps in clustering the graph sets, finding a relationship between graph sets, or discriminating or characterizing graphs. Predicting these patterning trends can help in building models for the enhancement of any application that is used in real-time. To implement the process of graph mining, one must learn to mine frequent subgraphs.
Frequent Subgraph Mining
Let us consider a graph h with an edge set E(h) and a vertex set V(h). Let us consider the existence of subgraph isomorphism from h to h’ in such a way that h is a subgraph of h’. A label function is a function that plots either the edges or vertices to a label. Let us consider a labeled graph dataset, Let us consider s(h) as the support which means the percentage of graphs in F where h is a subgraph. A frequent graph has support that will be no less than the minimum support threshold. Let us denote it as min_support.
Steps in finding frequent subgraphs:
There are two steps in finding frequent subgraphs.
- The first step is to create frequent substructure candidates.
- The second step is to find the support of each and every candidate. We must optimize and enhance the first step because the second step is an NP-completed set where the computational complexity is accurate and high.
There are two methods for frequent substructure mining.
The Apriori-based approach: The approach to find the frequent graphs begin from the graph with a small size. The approach advances in a bottom-up way by creating candidates with extra vertex or edge. This algorithm is called an Apriori Graph. Let us consider Qk as the frequent sub-structure set with a size of k. This approach acquires a level-wise mining technique. Before the Apriori Graph technique, the generation of candidates must be done. This is done by combining two same but slightly varied frequent subgraphs. After the formation of new substructures, the frequency of the graph is checked. Out of that, the graphs found frequently are used to create the next candidate. This step to generate frequent substructure candidates is a complex step. But, when it comes to generating candidates in itemset, it is easy and effortless. Let’s consider an example of having two itemsets of size three such that and So, the itemset derived using join would be pqrs. But when it comes to substructures, there is more than one method to join two substructures.
Algorithm:
This approach is based on frequent substructure mining.
Input: F= a graph data set. min_support= minimum support threshold Output: Q1,Q2,Q3,....QK, a frequent substructure set graphs with the size range from 1 to k. Q1 <- all the frequent 1 subgraphs in F; k <- 2; while Qk-1 ≠ ∅ do Qk <- ∅; Gk <- candidate_generation(Qk-1); foreach candidate l ∈ Gk do l.count <- 0; foreach Fi ∈ F do if isomerism_subgraph(l,Fi) then l.count <- l.count+1; end end if l.count ≥ min_support(F) ∧ l∉Qk then Qk = Qk U l; end end k <- k+1; end
It is an iterative method in which the first candidate generation takes place followed by the support computation. The subgraphs are generated using subgraph isomorphism. Thus frequent subgraphs are generated by efficiently using this approach which helps in FSM. Apriori approach uses BFS(Breadth-First Search) due to the iterative level-wise generation of candidates. This is necessary because if you want to mine the k+1 graph, you should have already mined till k subgraphs.
The Pattern- growth approach: This pattern-growth approach can use both BFS and DFS(Depth First Search). DFS is preferred for this approach due to its less memory consumption nature. Let us consider a graph h. A new graph can be formed by adding an edge e. The edge can introduce a vertex but it is not a need. If it introduces a vertex, it can be done in two ways, forward and backward. The Pattern-growth graph is easy but it is not that efficient. Because there is a possibility of creating a similar graph that is already created which leads to computation inefficiency. The duplicate graphs generated can be removed but it increases the time and work. To avoid the creation of duplicate graphs, the frequent graphs should be introduced very carefully and conservatively which calls the need for other algorithms.
Algorithm:
The below algorithm is a pattern-growth-based frequent substructure mining with a simplistic approach. If you need to search without duplicating, you must go with a different algorithm with gSpan.
Input: q= a frequent graph F= a graph data set. min_support= minimum support threshold Output: P = the frequent graph set P <- ∅; Call patterngrow_graph(q, F, min_support, P); procedure patterngrow_graph(q, F, min_support, P) if q ∈ P then return; else insert q into P; scan F once, find all the edges e such that q can be extended to q -> e; for each frequent q -> e do PatternGrowthGraph(q -> e, D, min_support, P); return;
An edge e is used to extend a new graph from the old one q. The newly extend graph is denoted as e. " title="Rendered by QuickLaTeX.com" height="20" width="85" style="vertical-align: -5px;"> The extension can either be backward or forward.
Data Mining Graphs and Networks
Data mining is the process of collecting and processing data from a heap of unprocessed data. When the patterns are established, various relationships between the datasets can be identified and they can be presented in a summarized format which helps in statistical analysis in various industries. Among the other data structures, the graph is widely used in modeling advanced structures and patterns. In data mining, the graph is used to find subgraph patterns for discrimination, classification, clustering of data, etc. The graph is used in network analysis. By linking the various nodes, graphs form network-like communications, web and computer networks, social networks, etc. In multi-relational data mining, graphs or networks is used because of the varied interconnected relationship between the datasets in a relational database.
Contact Us