Graph generation
Phase 1 : Python Dependencies¶
The very first cell installs two Python libraries that are not available by default in Colab:
-
pyvis— generates the interactive HTML graph renderings used by every visualization function inviz_package. Every clickable node and every coloured edge goes through pyvis. -
ipywidgets— powers the interactive controller panel in the analysis phases (dropdown menus, buttons, sliders).
All other dependencies (networkx, neo4j, numpy, etc.) will be installed automatically later, via the requirements.txt file from the repository cloned in Phase 3.
Phase 2 : Local Neo4j Instance with APOC¶
Why Neo4j?¶
An Active Directory is fundamentally a graph: users, computers, groups and policies are connected by permission, membership or trust relationships. Neo4j is a database designed specifically to store and query this kind of structure. ADSimulator writes its generated AD into Neo4j using the Cypher query language, and it is from Neo4j that attacks will later be simulated.
What this cell does¶
This cell sets up a Neo4j server entirely in local, without administrator rights and without any system-level installation:
-
Downloads Neo4j Community 5.18.0 from the official website and extracts it into a
neo4j_local/folder. -
Installs the APOC plugin (Awesome Procedures On Cypher), an extension that ADSimulator requires for advanced operations such as JSON graph export (
apoc.export.json). -
Configures two files to allow APOC to run and to write files to disk.
-
Sets the initial password to
password— the value expected by default by ADSimulator.
Important: Neo4j is not started here. ADSimulator will handle that automatically when it needs it in Phase 7. This cell only prepares the installation.
♻️ The
rm -rf neo4j_localline at the top guarantees a clean install if you re-run the notebook from scratch.
%%bash
NEO4J_VERSION="5.18.0"
rm -rf neo4j_local
if [ ! -d "neo4j_local" ]; then
wget -q -nc https://neo4j.com/artifact.php?name=neo4j-community-$NEO4J_VERSION-unix.tar.gz -O neo4j.tar.gz
tar -xzf neo4j.tar.gz
mv neo4j-community-$NEO4J_VERSION neo4j_local
rm neo4j.tar.gz
fi
if [ ! -f "neo4j_local/plugins/apoc-$NEO4J_VERSION-core.jar" ]; then
wget -q -nc https://github.com/neo4j/apoc/releases/download/$NEO4J_VERSION/apoc-$NEO4J_VERSION-core.jar \
-P neo4j_local/plugins/
fi
CONF_FILE="neo4j_local/conf/neo4j.conf"
APOC_CONF="neo4j_local/conf/apoc.conf"
if ! grep -q "dbms.security.procedures.unrestricted=apoc.\*" "$CONF_FILE"; then
echo "dbms.security.procedures.unrestricted=apoc.*" >> "$CONF_FILE"
echo "apoc.export.file.enabled=true" > "$APOC_CONF"
./neo4j_local/bin/neo4j-admin dbms set-initial-password "password"
fi
chmod -R 755 neo4j_local
echo "[+] Neo4j environment ready!"
Phase 3 : Cloning the Project Repository¶
This cell downloads the cyber_project_adas repository, which contains all of the project source code.
Repository structure¶
Why --recursive and git lfs pull?¶
The adsimulator/ folder is not ordinary code: it is a Git submodule, meaning a link to an independent external repository (nicolas-carolo/adsimulator). Without the --recursive flag, this folder would remain empty after cloning.
In addition, some of ADSimulator's data files (lists of department names, operating systems, etc.) are stored using Git LFS (Large File Storage), a Git extension for large files. The git lfs pull commands are therefore required to fetch the actual files rather than mere pointers.
Finally, pip install -r requirements.txt installs all remaining Python dependencies of the project in a single command.
%%bash
rm -rf cyber_project_adas /root/.adsimulator
git clone https://github.com/Maelh1/cyber_project_adas
cd cyber_project_adas
git submodule update --init --recursive
git lfs pull
git submodule foreach git lfs pull
pip install -r requirements.txt
cd ..
echo "[+] Repository ready."
ls -la cyber_project_adas
Phase 4 : Installing ADSimulator¶
ADSimulator is the generation engine of the Active Directory graph. It is distributed as a standard Python package, embedded inside the repository cloned in the previous phase. Its installation requires two adjustments specific to the Colab environment:
1. Root restriction bypass
ADSimulator refuses by default to run as root, as a security measure. Colab kernels run precisely with root privileges. The fix is straightforward: create the enable_root.cfg configuration file that the program checks in order to lift this restriction.
2. Data file copy
ADSimulator needs reference files (name lists, operating system lists, department names...) at a specific location on disk: /root/.adsimulator/data/. These files are present in the cloned repository and must be copied there manually.
The pip uninstall then python setup.py install sequence ensures that the version from the current repository is used, rather than one cached from a previous run.
%%bash
cd cyber_project_adas/adsimulator_graph_generator/adsimulator
echo "[*] Triggering the developer's root bypass..."
mkdir -p /root/.adsimulator
touch /root/.adsimulator/enable_root.cfg
echo "[*] Copying required data files..."
cp -r data /root/.adsimulator/
pip uninstall -y adsimulator
python setup.py install
Phase 5 : Locating the ADSimulator Executable¶
Once installed, ADSimulator is accessible as a system command (an adsimulator binary somewhere in the PATH). This short Python cell locates that executable dynamically using shutil.which, exactly as the which adsimulator command would in a terminal.
A fallback mechanism (/usr/local/bin/adsimulator) is provided for the case where Jupyter's path cache has not yet registered the installation performed in the previous phase. The found path is stored in the adsim_exe variable, which adsim_utils will use in Phase 7 to launch the generation.
import shutil
import subprocess
adsim_exe = shutil.which("adsimulator")
if adsim_exe is None:
adsim_exe = "/usr/local/bin/adsimulator"
Phase 6 : Importing the Project Modules¶
The project is now installed and located. This cell loads all the necessary Python modules into memory for the rest of the pipeline.
Three families of modules are imported:
-
Project-specific pipeline modules:
adsim_utils(AD generation),attacks(attack simulation),viz_toolsandprocess_graph(graph filtering and processing). All located incyber_project_adas/adsimulator_graph_generator/src/. -
Standard scientific stack:
networkx(graph manipulation),numpy(numerical computation),matplotlib(static visualization),neo4j(database connection), etc. -
viz_package: the project's visualization and control module, exposing 24 public functions covering graph analysis, browsing, and defensive simulation.
The importlib.reload(adsim_utils) line forces the module to be reloaded from disk, preventing any conflict with a version cached from a previous run.
import sys, os
repo_path = os.path.abspath("./cyber_project_adas/adsimulator_graph_generator")
if repo_path not in sys.path:
sys.path.insert(0, repo_path)
from src import adsim_utils, attacks, viz_tools
import importlib
importlib.reload(adsim_utils)
import json, random, subprocess, time
import networkx as nx
import numpy as np
from neo4j import GraphDatabase
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from src import viz_package as vp
print(f"[+] viz_package loaded from: {vp.__file__}")
print(f"[+] viz_package: {len([n for n in dir(vp) if not n.startswith('_') and callable(getattr(vp, n))])} public symbols")
Phase 7 : Generating the Synthetic Active Directory Graph¶
This is the central phase of the entire setup: it produces the AD graph that will serve as the training ground for all attack simulations.
What adsim_utils.run_pipeline() does¶
A single function call orchestrates four operations in sequence:
-
Starting Neo4j — the server installed in Phase 2 is launched.
-
Clearing the database — any existing graph is wiped to start from a clean state.
-
Generating the AD environment — ADSimulator populates the Neo4j database with nodes (users, computers, groups, OUs, GPOs...) and relationships (permissions, memberships, delegations...) according to the parameters defined in
config. -
Exporting to JSONL — the complete graph is exported to
Dataset/graph_0.json, the format consumed by all subsequent analysis modules.
The argument 0 is the run index: if you re-run the cell with the same index, the export file is simply overwritten.
Understanding the config dictionary¶
The config dictionary is the single place where you control the shape of your synthetic AD. It is organised by AD object family:
| Section | What you control |
|---|---|
Domain |
Domain functional level (2008 to 2016), inter-domain trust relationships |
Computer |
Number of machines, OS, exposed protocols (RDP, PSRemote, DCOM), Kerberos delegation |
DC |
Number and OS of domain controllers, LAPS deployment |
User |
Number of users, Kerberoastable accounts, AS-REP Roastable accounts, unconstrained delegation |
OU |
Number of organisational units |
Group |
Number of groups, nesting probability, department distribution |
GPO |
Number of group policy objects |
ACLs |
Percentage of principals with ACEs, permission types (GenericAll, WriteDacl, etc.) |
The notebook's default configuration generates a deliberately small and readable environment: 5 computers, 5 users, 5 OUs, 2 groups, 1 GPO, with no ACLs and no delegation. This is a good starting point for understanding the graph structure before scaling it up.
config = {
"Domain": {
"functionalLevelProbability": {"2016": 100, ...},
"Trusts": {"SIDFilteringProbability": 100, ...}
},
"Computer": {"nComputers": 5, ...},
"User": {"nUsers": 5, ...},
"OU": {"nOUs": 5},
"Group": {"nGroups": 2, ...},
"GPO": {"nGPOs": 1},
"ACLs": {"ACLPrincipalsPercentage": 0, ...}
}
adsim_utils.run_pipeline(0, custom_config=config)
End Result: an AD Graph Ready for Attack Simulation¶
At the end of these 7 phases, your Colab environment contains:
- A fully operational Neo4j instance with the AD graph loaded in the database
- A
Dataset/graph_0.jsonfile containing the complete graph in JSONL format (nodes + relationships) - All project Python modules imported and ready to use
This graph is the foundation on which all subsequent steps are built: probability annotation, simulation of the 4 attack families, and interactive analysis via the viz_package controller.