공개 SW 활용

Part 2.

Graph DBMS for intelligent processing of big graphs
(초거대 그래프의 지능적 고속 처리를 위한 그래프 DBMS 기술 개발)
About S62 Project
A distributed graph DBMS designed for intelligent processing of big graphs is a system that can raise the competitiveness of big data and artificial intelligence-based business. The system can process important graph applications by supporting efficient storage, analysis, and machine learning of large graphs on cloud-based distributed machines.

At the core of graph applications are three essential queries: graph pattern mining, graph analytics, and graph learning.
Nonetheless, current graph DBMSs struggle to handle these queries for the following reasons:
1) Query processing is inefficient because it does not use the native graph storage.
2) Limitations on scalability due to reliance on in-memory processing.
3) There is no integrated process for graph application that integrates the three important queries mentioned earlier.


Our goal is to develop a next-generation graph DBMS that can support graph applications mixed with three key graph queries within a single system, ensuring scalable and fast query processing for big graphs. The main components of the system are 1) Big/Dynamic native property graph storage, 2) Distributed and disk-based scalable, an efficient graph query engine, 3) Distributed graph neural network training and inference engine, 4) Integrated API library, and 5) Interactive query service.

S62 is an open source project, mainly contributed by DSLAB in POSTECH, CUBRID, DKELab in Kyung Hee University, DKELab In Kangwon University, and KOSSA.
Features of S62 Project
  • 01
    Flexible yet Fast Schemaless Computation

    Support schemaless property graph model and openCypher query language

    Graph-native, structure-aware columnar storage

    Fast bulkloading with Modern HW capabilities such as SIMD

    Vectorized, schemaless query execution engine

    Top-down openCypher query optimizer based on the cascades optimization framework

    Efficiently supports both graph and relational workloads

  • 02
    Fast Distributed Graph Neural Network System

    Efficient learning based on Intermediate Data Caching

    Task-separated multi-process learning scheduler

    Fast, low-overhead GDLL sampler that maximizes GPU utilization

    Effective feature caching for distributed environment

    Learning engine tightly connected to GDBMS

  • 03
    Elastic Tools for GDBMS utility

    Supports graphical analysis of GDBMS

    Supports various data migration methods between RDBMS and GDBMS, online or offline

    Accelerated data migration based on multi-threading

    Supports automatic transformation of relational data model to graph data model

S62 Open Source User Guide
S62 is a graph DBMS that can support graph applications mixed with three key graph queries within a single system, ensuring scalable and fast query processing for big graphs. S62 takes openCypher query as input. S62 supports a general property graph model, with a particular focus on the schemaless feature of the model.
For instance, vertices with the same label don't necessarily need to have the same properties. The most straightforward way to support schemaless is to embed the schema into each tuple. However, this approach can slow down query processing performance because it doesn't allow for storing the entire data in a structured(e.g., columnar) format. In S62, while supporting schemaless data well, it utilizes structure-aware storage to efficiently support both graph and relational workloads.
Getting Started

We provide scripts and docker to help developers setup and install S62.

Get the S62 Source
Docker Setup

We provide a script for docker installation to avoid installing system dependencies required to build S62. Use the following commands:

cd docker # move to docker directory
docker build . –t s62-image # build docker image

# run ‘docker images’ to check whether ‘s62-image’ has been created
# change docker image & container name in the ‘run-docker-example.sh’ script
./run-docker-example.sh <directory where database will locate> <input data folder> # create docker container
Build S62

The project consists of several modules using various platforms. We mainly use cmake as project builder, and ninja for fast incremental compilation.

mkdir build && cd build
cmake –Gninja –DCMAKE_BUILD_TYPE=Release ..
# if you want to compile the project in debug mode, use ‘Debug’ instead of ‘Release’
ninja –j [NUM_PARALLEL_BUILD_THREADS]

You can see the system components and their dependencies using the following command.

cmake –Gninja –DCMAKE_BUILD_TYPE=Debug --graphviz=graphviz/dependencies.dot ..

To build specific target,

ninja –j [TARGET_NAME]
Project directories

The list below explains directories in the project directory

github/ : Github automation

bin/ : binary files and shell scripts for executing the project

build/ : build configuration files

conf/ : system configuration file templates

data/ : data files for running toy examples

docker/ : docker related files

docs/ : design documents and user documents

examples/ : getting started examples

k8s/ : kubernetes files

licenses/ : licence files

tools/ : useful tools

test/ : test cases for system tests

tbgpp-*/ : source codes and test cases of each module

src : source codes (.cxx) for our native implementations (native namespace)

include : headers (.hxx) for our native implementations (native namespace)

libABC : heaviliy modified external libraries we depend on

third_party : directory for barely modified external libraries we depend on

S62 Open Source User Guide - MiT
GDBMS Migration Toolkit for S62

MiT (Migration Toolkit) is a tool that automatically transforms relational DBs into graph DB models and then converts the relational data into vertices and edges suitable for a graph DBMS, based on the transformed model.
MiT was developed based on the CUBRID Migration Toolkit [version 11.0.0.0002] (https://github.com/cubrid/cubrid-migration).

The additional features of MiT include:

Connection and testing of Graph DBMS

Automatic model change and modification based on metadata

Various data migration based on the changed model (on-line, cypher file, csv file)

Migration report for data migration

Getting started

MiT can be built in a Linux environment by downloading the source code and using the build script.

The user manual (https://hwany7seo.github.io/mit_manual/start.html) can be referred to for instructions on how to use MiT.

Download Source Code
git clone https://github.com/postech-dblab-iitp/migration-toolkit.git
Program Build Requirements

The build of MiT is currently supported only in a Linux environment. Centos 7 is recommended.

Since java and eclipse required for the build are included in com.cubrid.cubridmigration.build, there is no need for the user to set up the environment additionally. The versions of each program are as follows:
- java 1.7
- Eclipse IDE Helios R

How to Execute Build
cd com.cubrid.cubridmigration.build
sh build.sh
License

MiT is subject to the same license as CUBRID CMT.

Apache license 2.0

Getting Help

http://jira.iitp.cubrid.org/secure/Dashboard.jspa

You can receive support by leaving details on the above jira in case of bugs, improvements, or questions.

Apache license 2.0

MiT Project Directory

LICENSE/ : A directory containing txt files of licenses for libraries, frameworks, etc., used in MiT

MiT_Manual/ : A directory where the manual on how to use MiT is written in rst files

MiT_docs/ : A directory containing documents related to the design of MiT

com.cubrid.common.configuration/ : Manages the execution, termination, class loading, etc.

com.cubrid.common.update.feature/ : Information about libraries and plugins to be updated

com.cubrid.common.update/ : Manages updates and update checks for MiT

com.cubrid.cubridmigration.app.feature/ : Configures the app's features

com.cubrid.cubridmigration.app.update.site/ : Stores URLs for fetching web information displayed upon MiT execution

com.cubrid.cubridmigration.app/ : The first application screen displayed when the program is run

com.cubrid.cubridmigration.build/ : Information and shell scripts for building the program

com.cubrid.cubridmigration.command/ : Project responsible for script migration

com.cubrid.cubridmigration.core.testfragment/ : Test codes for the core project

com.cubrid.cubridmigration.core/ : Handles the core features of MiT, such as migration and page navigation

com.cubrid.cubridmigration.plugin.feature/ : Configures the features of the plugin project

com.cubrid.cubridmigration.plugin.update.site/ : Saves URLs to connect to during plugin updates

com.cubrid.cubridmigration.plugin/ : Project for setting up the MiT plugins

com.cubrid.cubridmigration.ui.testfragment/ : Test codes for the UI project

com.cubrid.cubridmigration.ui/ : Project responsible for the UI of MiT

S62 Open Source User Guide - ViT
GDBMS Visual-Tool for S62

ViT (Visual Tool) is a program that allows you to query a graph DBMS in an interactive environment and visually express and analyze the results.
ViT was developed based on DBeaver v21.2.2 (https://github.com/dbeaver/dbeaver ), an open source tool that supports various DBMSs, using additional open sources such as Gephi to visualize the results of graph DBMS.

The features added for graph DBMS are as follows.

Connection part: Graph DBMS connection and connection testing

Navigation part: Displays vertex and edge information of connected graph DB

Query window: Create and request queries of connected graph DB

Visualization window

Visualize query results as a graph (displaying vertices, edges, and labels)

Graph editing (moving, highlighting, changing properties, etc.)

Mini map

Change layout

Analysis functio ns (shortest path, etc.)


Development-related documents, etc. are managed in the ViT_docs folder.

Getting Started

ViT can be built through a build script after downloading the source code in a Linux environment.
For information on how to use Vit, please refer to the user manual (https://hwany7seo.github.io/vit_manual/start.html).

Download Source Code
git clone https://github.com/postech-dblab-iitp/visual-tool.git
Program Build Requirements

The build of MiT is currently supported only in a Linux environment. The programs required for building are as follows:
- JDK 11
- Apache Maven 3.8.6+
- Git
- Internet Connection

How to Execute Build
git clone https://github.com/postech-dblab-iitp/visual-tool.git
cd visual-tool
sh build.sh
License

Apache license 2.0

Getting Help

http://jira.iitp.cubrid.org/secure/Dashboard.jspa

You can receive support by leaving details on the above jira in case of bugs, improvements, or questions.

ViT Project Directory

ViT_Manual/: ViT manual in rst format

ViT_docs/ : ViT design documents

bundles/ : basic plugins

docs/ : Original DBeaver documentation

features/: Used to structure the program’s plugins and dependencies

gephi-toolkit/ : Visualization library used to display graphs

plugins/: Original source, see DBeaver wiki for details (https://github.com/dbeaver/dbeaver/wiki/Develop-in-Eclipse)

product/: final program settings

test/ : Original DBeaver test code

S62 Open Source User Guide - GNN
Feature Caching on GPU for Expediting GNN Training

Most of the graph data contains a lot of features or characteristics. Typically, graph neural network models are trained on GPUs for faster computation, which requires input features. Graph features are stored on the CPU and need to be loaded onto the GPU during model computation, which is time-consuming. Feature Caching on GPU aims to cache the required features on the GPU and use them during model training instead of loading them from the CPU to minimize the training time.

Main Function

1. Input the target graph dataset with features in the mentioned file and run the python program for GNN model training.

2. After running the program, it will cache the features on GPU and expedite the GNN training.

How to Use

1. Install Python compiler.

2. Install prerequisite libraries for compiling the program.

3. Input the target graph dataset into the desired folder.

4. Compile the program through command line.

5. Get the expedite GNN training result.

Layered Dependent Importance sampling and Sample-based GNN Training over GPUs

The GPU Accelerator accelerates the GNN training by utilizing GPUs for generating layered dependent samples as well as training GNN over GPUs.
It consists of two modules, i.e., GPU sampler and a Pipeline for GNN training.

Layered Dependent Importance Sampler

Hybrid CPU-GPU training Pipeline

Multi-processing feature for CPU and GPU training

Multi-threading feature gradient calculation, gradient accumulation, and model updates.

Main Function

GPU_Accelerator: This is the main script that provides CPU-GPU training Pipeline for GNN.

Sample_Generator: Generates samples from graph for input nodes.

Layered_dependent_samples: Generates layered dependent samples for given depth

Sample_Consumer: Consumes samples by training GNN over GPUs.

Gradient_Accumulation: Accumulate gradient over world size(number of GPUs).

How to Use

GPU training pipeline:

1. Provide a graph name, number of epochs, number of GPUs, and a batch size

2. Run using the command “python GPU_Accelerator.py”

Layered Dependent Samples:

1. Provide a graph name, fanout and number of layers of GNN

2. Run using the command “python layered_depdent_sampler.py”

This project is supported by IITP grant funded by the Korea government(MSIT) (No.2021-0-00859, Graph DBMS
for intelligent processing of big graphs)