How to Generate Experimental Results in Our Publication
We assume that you already downloaded and compiled the RelaxMap as in here.How to get Datasets
The benchmark datasets generated by a benchmark network generator are in data/ directory of the source code repository.
Those benchmark datasets are stored as in compressed format, so you should uncompress them before use them.
In data/ directory of the RelaxMap source code directory, you can run simple tar command to extract those datasets like following:
$ tar xvf directNet_1k_sm_mu_network.tgz
You can extract other datasets in data/ directory as by replacing other .tgz files instead of directNet_1k_sm_mu_network.tgz.
Note that *_network.tgz files contain benchmark network files and *_known_community.tgz files contain corresponding ground truth community files.
Also, you can find the real-world datasets used in our paper from the SNAP Network Data Collection.
If you are in the RelaxMap source code directory, you can download real-world graph datasets from the SNAP Network data repository by using following commands:
[For Mac OS X machine (which doesn't have 'wget' command)]
$ curl http://snap.stanford.edu/data/web-Stanford.txt.gz -o data/web-Stanford.txt.gz
[For Linux machine]
$ gunzip data/web-Stanford.txt.gz
$ wget http://snap.stanford.edu/data/web-Stanford.txt.gz
We use total 6 real-world datasets from the SNAP Network Data Collection,
and you could download other network datasets by replacing [web-Stanford.txt] with
appropriate file names, such as web-NotreDame.txt, web-BerkStan.txt, soc-LiveJournal1.txt, and so on.
$ gunzip web-Stanford.txt.gz
$ mv web-Stanford.txt data/
Command to run our (prioritized) RelaxMap experiments
[Usage]: $ ./ompRelaxmap [seed] [network data] [# threads] [# attempts] [threshold] [vThresh] [maxIter] [outDir] [prior/normal]
example: $ ./ompRelaxmap 1 data/web-Stanford.txt 4 1 1e-3 0.0 10 outputs/ prior
args[0]: seed - we used 1 ~ 100 as random seed values for benchmark datasets,
and 1 ~ 10 for real-world datasets.
args[1]: network data file
args[2]: # thread - we used 1, 2, 4, 8, and 12 for two different machines.
args[3] - [6]: we used fixed values for these arguments as shown above.
args[7]: output directory location.
args[8]: we set the flag prior for prioritized search which is more efficient,
and normal for non-prioritized results.
The example command shows how to run the prioritized RelaxMap with 4-way parallel execution. If you want to run non-prioritized RelaxMap, you should change the last argument from prior to normal. Note that the elapsed time means time used for overall graph partitioning (printed as "Time for partitioning").
Command to run Infomap
Our paper compared the output quality of the proposed (prioritized) RelaxMap algorithm to
that of the original sequential community detection algorithm with the map equation, named Infomap.
You can download the latest Infomap source code from here.
You can easily compile Infomap from their instruction. After compiling Infomap, you can run the experiment we did by following command:
$ mkdir outputs
Here [data_dir] should be the data directory where you put the graph data,
and "web-Stanford.txt" dataset is one of the dataset you could download
from the SNAP Network Data Collection, as shown above.
The option '-s' is the random seed, so if you want to run multiple times with different random seed
you can use different integers with the option.
$ ./Infomap [data_dir]/web-Stanford.txt outputs/ -2 -d -s 1 -m 1e-3 -v -o -e
Some Experimental Results
You can check out the details of the experimental results from our ICDMW 2013 paper.
Here, we present briefly the experimental results.
Quality Comparison [Benchmark Datasets]
Calculating normalized mutual information (NMI)
The normalized mutual information (NMI) is used for comparing two given community structures of the same dataset. If NMI is equal to 1, then the given two community structures are identical, and if NMI is equal to 0, then the given two community structures are totally independent. You can find how to calculate NMI from Section III of this paper.
If you run our algorithm to find a community structure of a benchmark dataset,
you can calculate NMI by using a simple perl-script, called calculate_normalized_mutual_info.pl.
If you install our RelaxMap application correctly, you have some bencmark datasets in data/ directory as in .tgz format.
You can extract each dataset file as shown "How to get Datasets" section.
We assume that you are in the directory of RelaxMap implementation and put the perl script under that directory.
Now you can run our algorithm and calculate NMI for the output as below:
$ ./ompRelaxmap 1 data/directNet_5k_sm_mu_07.txt 4 1 1e-3 0.0 10 outputs/ prior
$ ./calculate_normalized_mutual_info.pl -c data/directNet_5k_sm_mu_07_community.dat -f outputs/directNet_5k_sm_mu_07.clu -n 5000
Note that the output of the RelaxMap will be stored in "outputs/" directory as we put argument [7], with the file name as same as data file name with extension .clu (clustering info) and .tree (tree structure info). So, if you want to run multiple times and keep all the outputs of the same dataset separately, then you should move the outputs with different name. Otherwise, only outputs of the last experiment of the dataset will be remained.
Here, we show an example of how to calculate NMI of an output with a benchmark dataset (directNet_5k_sm_mu_07.txt). For the given perl-script, we should provide the original ground truth community structure (with '-c' option), the clustering output of the algorithm (with '-f' option), and the number of points, here 5000.
A graph for normalized mutual information of outputs of RelaxMap and Infomap
The above graph shows the average NMI of 100 outputs as a function of the mixing parameter comparing Infomap and RelaxMap for the directNet-5k benchmark dataset. Since the ground truth community structure is given for each benchmark dataset, we measured NMI by comparing the outputs of the Infomap and RelaxMap algorithms with the corresponding known ground truth community structures. This graph implies that the RelaxMap generates outputs that are very similar to the outputs of the Infomap algorithm.
Quality Comparison [Real-World Datasets]
The output quality comparison between the Infomap and RelaxMap algorithm (1-way and 8-way parallelism) with soc-LiveJournal1 datasets from SNAP data repository. In this plot, the x-axis represents three different algorithms and the y-axis is the MDL score. The output qualities of the RelaxMap-1 and RelaxMap-8 are similar to the output quality of the Infomap.
Shared-Memory Parallel Efficiency
Parallel efficiency is defined as below:
efficiency = T1 / (Tp * p)
where T1 = time for sequential run, Tp = time for p-way parallel run,
and p = the number of parallel units.
Parallel efficiency of the RelaxMap on a 12-core machine, two six-core Intel Xeon E5-2430L processors (2.00 GHz, 15 MB Intel Smart Cache) and 64 GB of main memory, with 6 real world datasets. We measure the parallel efficiency based on average times for partitioning of 10 runs.