Dagster
About
The following is a description of the steps and requirements for building and deploying the docker based workflow implemented in dagster.
Overview
The image following provides a broad overview of the elements that are loaded in to the Docker orchestration environment. This is a very basic view and doesn't present any scaling or fail over elements.
The key elements are:
- sources to configuration and then the creation of the archive files that are loaded and used to load into the Gleaner and Nabu tools
- The Dagster set which loads three containers to support workflow operations
- The Gleaner Architecture images which loads three or more containers to support
- s3 object storage
- graph database (triplestore)
- headless chrome for page rendering to support dynamically inserted JSON-LD
- any other support packages like text, semantic or spatial indexes
- The GleanerIO tools which loads two containers (Gleaner and Nabu) that are run and removed by the Dagster workflow
Template files
The template files define the Dagster Ops, Jobs and Schedules. From these and a GleanerIO config file a set of Python scripts for Dagster are created in the output directory.
These only need to be changed or used to regenerate if you wish to alter the execution graph (ie, the ops, jobs and schedules) or change the config file. In the later case only a regeneration needs to be done.
There are then Docker build scripts to build out new containers.
See: template
Steps to build and deploy
1) Move to the implnets directory 1) Place your gleanerconfig.yaml (use that exact name) in confgis/NETWORK/gleanerconfig.yaml 1) Note: When doing your docker build, you will use this NETWORK name as a value in the command such as
podman build --tag="docker.io/fils/dagster_nsdf:$(VERSION)" --build-arg implnet=nsdf --file=./build/Dockerfile
1) Make any needed edits to the templates in directory templates/v1/ or make your own template set in that directory
The command to build using the pygen.py program follows. This is done from the standpoint of running in from the implenet directory.
python pygen.py -cf ./configs/nsdf/gleanerconfig.yaml -od ./generatedCode/implnet-nsdf/output -td ./templates/v1 -d 7
1) This will generate the code to build a dagster instance from the combination of the templates and gelanerconfig.yaml. 2)
Archive files
The archive files need to be compress tar files with the Gleaner or Nabu configs and other required files like the schema context.
The archive files are defined in:
ARCHIVE_FILE = os.environ.get('GLEANERIO_GLEANER_ARCHIVE_OBJECT')
ARCHIVE_PATH = os.environ.get('GLEANERIO_GLEANER_ARCHIVE_PATH')
ARCHIVE_FILE = os.environ.get('GLEANERIO_NABU_ARCHIVE_OBJECT')
ARCHIVE_PATH = os.environ.get('GLEANERIO_NABU_ARCHIVE_PATH')
The required config file names are expressed in the CMD line:
CMD = ["--cfg", "/gleaner/gleanerconfig.yaml", "--source", source]
CMD = ["--cfg", "/nabu/nabuconfig.yaml", "prefix", "summoned/" + source]
So: gleanerconfig.yaml and nabuconfig.yaml are the required configuration names.
NOTE: At present only yaml is supported, JSON support is a simple addition once the system is tested and working OK with the yaml files.
The contents will look something like
❯ tar -ztf GleanerCfg.tgz
./gleanerconfig.yaml
./jsonldcontext.json
❯ tar -ztf NabuCfg.tgz
./nabuconfig.yaml
./jsonldcontext.json
These files need to be in the bucket prefix defined by:
GLEANERIO_GLEANER_ARCHIVE_OBJECT=scheduler/configs/GleanerCfg.tgz
GLEANERIO_NABU_ARCHIVE_OBJECT=scheduler/configs/NabuCfg.tgz
Environment files
export PORTAINER_URL=http://example.org:9000/api/endpoints/2/docker/
export PORTAINER_KEY=KEY
export GLEANERIO_GLEANER_IMAGE=fils/gleaner:v3.0.11-development-df
export GLEANERIO_GLEANER_ARCHIVE_OBJECT=scheduler/configs/testGleanerCfg.tgz
export GLEANERIO_GLEANER_ARCHIVE_PATH=/gleaner/
export GLEANERIO_NABU_IMAGE=fils/nabu:2.0.4-developement
export GLEANERIO_NABU_ARCHIVE_OBJECT=scheduler/configs/testNabuCfg.tgz
export GLEANERIO_NABU_ARCHIVE_PATH=/nabu/
export GLEANERIO_LOG_PREFIX=scheduler/logs/
export GLEANER_MINIO_URL=192.168.202.114
export GLEANER_MINIO_PORT=49153
export GLEANER_MINIO_SSL=False
export GLEANER_MINIO_SECRET=SECRET
export GLEANER_MINIO_KEY=KEY
export GLEANER_MINIO_BUCKET=gleaner.test
Implementation Networks
This (https://github.com/sharmasagar25/dagster-docker-example) is an example on how to structure a [Dagster] project in order to organize the jobs, repositories, schedules, and ops. The example also contains examples on unit-tests and a docker-compose deployment file that utilizes a Postgresql database for the run, event_log and schedule storage.
This example should in no way be considered suitable for production and is merely my own example of a possible file structure. I personally felt that it was difficult to put the Dagster concepts to use since the projects own examples had widely different structure and was difficult to overview as a beginner.
The example is based on the official [tutorial].
Folders
- build: build directives for the docker containers
- configs
- src
- tooling
Requirements
At this point it is expected that you have a valid Gleaner config file named gleanerconfig.yaml located in some path within the configs directory.
Building the dagster code from templates
The python program pygen will read a gleaner configuration file and a set of template and build the Dagster code from there.
python pygen.py -cf ./configs/nsdf/gleanerconfig.yaml -od ./src/implnet-nsdf/output -td ./src/implnet-nsdf/templates -d 7
Running
There is an example on how to run a single pipeline in src/main.py
. First
install the dependencies in an isolated Python environment.
pip install -r requirements
The code built above can be run locally, though your templates may be set up to reference services and other resources not present on your dev machine. For complex examples like these, it can be problematic.
If you are looking for some simple examples of Dagster, check out the directory examples for some smaller self-contained workflows. There are good for testing things like sensors and other approaches.
If you wish to still try the generated code cd into the output directory you specified in the pygen command.
Then use:
dagit -h ghost.lan -w workspace.yaml
Building
podman build -t docker.io/fils/dagster:0.0.24 .
podman push docker.io/fils/dagster:0.0.24
Appendix
Setup
Docker API sequence
Appendix
Portainer API setup
You will need to setup Portainer to allow for an API call. To do this look at the documentation for Accessing the Portainer API
Notes
Single file testing run
dagit -h ghost.lan -f test1.py
- Don't forget to set the DAGSTER_HOME dir like in
export DAGSTER_HOME=/home/fils/src/Projects/gleaner.io/scheduler/python/dagster
dagster-daemon run
Run from directory where workspace.yaml is.
dagit --host 192.168.202.159
Cron Notes
A useful on-line tool: https://crontab.cronhub.io/
0 3 * * * is at 3 AM each day
0 3,5 * * * at 3 and 5 am each day
0 3 * * 0 at 3 am on Sunday
0 3 5 * * At 03:00 AM, on day 5 of the month
0 3 5,19 * * At 03:00 AM, on day 5 and 19 of the month
0 3 1/4 * * At 03:00 AM, every 4 days
Indexing Approaches
The following approaches
- Divide up the sources by sitemap and sitegraph
- Also divide by production and queue sources
The above will result in at most 4 initial sets.
We can then use the docker approach
./gleanerDocker.sh -cfg /gleaner/wd/rundir/oih_queue.yaml --source cioosatlantic
to run indexes on specific sources in these configuration files.