myDIG is a tool to build pipelines that crawl the web, extract information, build a knowledge graph (KG) from the extractions and provide an easy to user interface to query the KG. The project web page is DIG.
You can install myDIG in a laptop or server and use it to build a domain specific search application for any corpus of web pages, CSV, JSON and a variety of other files.
the installation guide is below
user guide
advanced user guide
Operating systems: Linux, MacOS or Windows
System requirements: minimum 8GB of memory
myDIG uses Docker to make installation easy:
Install Docker and Docker Compose.
Configure Docker to use at least 6GB of memory. DIG will not work with less than 4GB and is unstable with less than 6GB.
On Mac and Windows, you can set the Docker memory in the Preferences menu of the Docker application. Details are in the Docker documentation pages (Mac Docker or Windows Docker). In Linux, Docker is built on LXC of kernel, the latest version of kernel and enough memory on host are required.
Clone this repository.
git clone https://github.com/usc-isi-i2/dig-etl-engine.git
myDIG stores your project files on your disk, so you need to tell it where to put the files. You provide this information in the .env
file in the folder where you installed myDIG. Create the .env
file by copying the example environment file available in your installation.
cp ./dig-etl-engine/.env.example ./dig-etl-engine/.env
After you create your .env
file, open it in a text editor and customize it. Here is a typical .env
file:
COMPOSE_PROJECT_NAME=dig DIG_PROJECTS_DIR_PATH=/Users/pszekely/Documents/mydig-projects DOMAIN=localhost PORT=12497 NUM_ETK_PROCESSES=2 KAFKA_NUM_PARTITIONS=2 DIG_AUTH_USER=admin DIG_AUTH_PASSWORD=123
COMPOSE_PROJECT_NAME
: leave this one alone if you only have one myDIG instance. This is the prefix to differentiate docker-compose instances.
DIG_PROJECTS_DIR_PATH
: this is the folder where myDIG will store your project files. Make sure the directory exists. The default setting will store your files in ./mydig-projects
, so do mkdir ./mydig-projects
if you want to use the default folder.
DOMAIN
: change this if you install on a server that will be accessed from other machines.
PORT
: you can customize the port where myDIG runs.
NUM_ETK_PROCESSES
: myDIG uses multi-processing to ingest files. Set this number according to the number of cores you have on the machine. We don't recommend setting it to more than 4 on a laptop.
KAFKA_NUM_PARTITIONS
: partition numbers per topic. Set it to the same value as NUM_ETK_PROCESSES
. It will not affect the existing partition number in Kafka topics unless you drop the Kafka container (you will lose all data in Kafka topics).
DIG_AUTH_USER, DIG_AUTH_PASSWORD
: myDIG uses nginx to control access.
If you are working on Linux, do these additional steps:
chmod 666 logstash/sandbox/settings/logstash.yml sysctl -w vm.max_map_count=262144 # replace <DIG_PROJECTS_DIR_PATH> to you own project path mkdir -p <DIG_PROJECTS_DIR_PATH>/.es/data chown -R 1000:1000 <DIG_PROJECTS_DIR_PATH>/.es
To set
vm.max_map_count
permanently, please update it in/etc/sysctl.conf
and reload sysctl settings bysysctl -p /etc/sysctl.conf
.
Move default docker installation (if docker runs out of memory) to a volume
sudo mv /var/lib/docker /path_with_more_space sudo ln -s /path_with_more_space /var/lib/docker
To run myDIG do:
./engine.sh up
Docker commands acquire high privilege in some of the OS, add
sudo
before them. You can also run./engine.sh up -d
to run myDIG as a daemon process in the background. Wait a couple of minutes to ensure all the services are up.
To stop myDIG do:
./engine.sh stop
(Use /engine.sh down
to drop all containers)
Once myDIG is running, go to your browser and visit http://localhost:12497/mydig/ui/
Note: myDIG currently works only on Chrome
To use myDIG, look at the user guide
myDIG v2 is now in alpha, there are couple of big and incompatible changes.
data, configs and logs of components are not in DIG_PROJECTS_DIR_PATH/.*
.
Kafka queue data will NOT be clean up even after doing ./engine.sh down
, you need to delete DIG_PROJECTS_DIR_PATH/.kafka
then restart engine (if you change NUM_ETK_PROCESSES
).
There's no default resource any more, if a resource file (glossary) is not compatible, please delete it.
There's no custom_etk_config.json
or additional_etk_config/*
any more, instead, generated ETK modules are in working_dir/generated_em
and additional modules are in working_dir/additional_ems
.
ETK log is not fully implemented and tested. Runtime logs will APPEND to working_dir/etk_worker_*.log
.
Spacy rule editor is not working.
ELK (Elastic Search, LogStash & Kibana) components had been upgraded to 5.6.4 and other services in myDIG also got update. What you need to do is:
Do docker-compose down
Delete directory DIG_PROJECTS_DIR_PATH/.es
.
You will lose all data and indices in previous Elastic Search and Kibana.
On 20 Oct 2017 there are incompatible changes in Landmark tool (1.1.0), the rules you defined will get deleted when you upgrade to the new system. Please follow these instructions:
Delete DIG_PROJECTS_DIR_PATH/.landmark
Delete files in DIG_PROJECTS_DIR_PATH/<project_name>/landmark_rules/*
There are also incompatible changes in myDIG webservice (1.0.11). Instead of crashing, it will show N/A
s in TLD table, you need to update the desired number.
MyDIG web service GUI: http://localhost:12497/mydig/ui/
Elastic Search: http://localhost:12497/es/
Kibana: http://localhost:12497/kibana/
Kafka Manager (optional): http://localhost:12497/kafka_manager/
# run with ache ./engine.sh +ache up # run with ache and rss crawler in background ./engine.sh +ache +rss up -d # stop containers ./engine.sh stop # drop containers ./engine.sh down
In .env
file, add comma separated add-on names:
DIG_ADD_ONS=ache,rss
Then, simply do ./engine.sh up
. You can also invoke additional add-ons at run time: ./engine.sh +dev up
.
ache
: ACHE Crawler (coming soon).
rss
: RSS Feed Crawler (coming soon).
kafka-manager
: Kafka Manager.
dev
: Development mode.
COMPOSE_PROJECT_NAME=dig DIG_PROJECTS_DIR_PATH=./../mydig-projects DOMAIN=localhost PORT=12497 NUM_ETK_PROCESSES=2 KAFKA_NUM_PARTITIONS=2 DIG_AUTH_USER=admin DIG_AUTH_PASSWORD=123 DIG_ADD_ONS=ache KAFKA_HEAP_SIZE=512m ZK_HEAP_SIZE=512m LS_HEAP_SIZE=512m ES_HEAP_SIZE=1g DIG_NET_SUBNET=172.30.0.0/16 DIG_NET_KAFKA_IP=172.30.0.200 # only works in development mode MYDIG_DIR_PATH=./../mydig-webservice ETK_DIR_PATH=./../etk SPACY_DIR_PATH=./../spacy-ui RSS_DIR_PATH=./../dig-rss-feed-crawler
If some of the docker images (which tagged latest
) in docker-compose file are updated, run docker-compose pull <service name>
first.
The data in kafka queue will be cleaned after two days. If you want to delete the data immediately, drop the kafka container.
If you want to run your own ETK config, name this file to custom_etk_config.json
and put it in DIG_PROJECTS_DIR_PATH/<project_name>/working_dir/
.
If you have additional ETK config files, please paste them intoDIG_PROJECTS_DIR_PATH/<project_name>/working_dir/additional_etk_config/
(create directory additional_etk_config
if
it's not there).
If you are using custom ETK config or additional etk configs, you need to take care of all file paths in these config
files. DIG_PROJECTS_DIR_PATH/<project_name>
will be mapped to /shared_data/projects/<project_name>
in docker, so make sure all the paths you used in config are start with this prefix.
If you want to clean up all ElasticSearch data, remove .es
directory in your DIG_PROJECTS_DIR_PATH
.
If you want to clean up all Landmark Tool's database data, remove .landmark
directory in your DIG_PROJECTS_DIR_PATH
. But this will make published rules untraceable.
On Linux, if you can not access docker network from host machine: 1. stop docker containers 2. do docker network ls
to find out id of dig_net
and find this id in ifconfig
, do ifconfig <interface id> down
to delete this network interface and restart docker service.
On Linux, if DNS does not work correctly in dig_net
, please refer to this post.
On Linux, solutions for potential Elastic Search problem can be found here.
If there's a docker network conflict, use docker network rm <network id>
to remove conflicting network.
POST /create_project
{ "project_name" : "new_project" }
POST /run_etk
{ "project_name" : "new_project", "number_of_workers": 4, "input_offset": "seek_to_end", // optional "output_offset": "seek_to_end" // optional }
POST /kill_etk
{ "project_name" : "new_project", "input_offset": "seek_to_end", // optional "output_offset": "seek_to_end" // optional }
Create .env
file from .env.example
and change the environment variables.
Run ./engine.sh up
for sandbox version.
Run docker-compose -f docker-compose-production.yml up
for production version.
DIG ETL Engine: 9999
Kafka: 9092
Zookeeper: 2181
ElasticSearch: 9200, 9300
Sandpaper: 9876
DIG UI: 8080
myDIG: 9879 (ws), 9880 (gui), 9881 (spacy ui), 12121 (daemon, bind to localhost)
Landmark Tool: 3333, 5000, 3306
Logstash: 5959 (udp, used by etk log)
Kibana: 5601
Nginx: 80
dig_net
is the LAN in Docker compose.
build Nginx image:
docker build -t uscisii2/nginx:auth-1.0 nginx/.
build ETL image:
# git commit all changes first, then ./release_docker.sh tag git push --tags # update DIG_ETL_ENGINE_VERSION in file VERSION ./release_docker.sh build ./release_docker.sh push
Invoke development mode:
# clone a new etl to avoid conflict git clone https://github.com/usc-isi-i2/dig-etl-engine.git dig-etl-engine-dev # swith to dev branch or other feature branches git checkout dev # create .env from .env.example # change `COMPOSE_PROJECT_NAME` in .env from `dig` to `digdev` # you also need a new project folder # run docker in dev branch ./engine.sh up # run docker in dev mode (optional) ./engine.sh +dev up
auto_offset_resetedit
Value type is string
There is no default value for this setting.
What to do when there is no initial offset in Kafka or if an offset is out of range:
earliest: automatically reset the offset to the earliest offset
latest: automatically reset the offset to the latest offset
none: throw exception to the consumer if no previous offset is found for the consumer’s group
anything else: throw exception to the consumer.
bootstrap_servers
Value type is string
Default value is "localhost:9092"
A list of URLs to use for establishing the initial connection to the cluster. This list should be in the form of host1:port1,host2:port2 These urls are just used for the initial connection to discover the full cluster membership (which may change dynamically) so this list need not contain the full set of servers (you may want more than one, though, in case a server is down).
consumer_threads
Value type is number
Default value is 1
Ideally you should have as many threads as the number of partitions for a perfect balance — more threads than partitions means that some threads will be idle
group_id
Value type is string
Default value is "logstash"
The identifier of the group this consumer belongs to. Consumer group is a single logical subscriber that happens to be made up of multiple processors. Messages in a topic will be distributed to all Logstash instances with the same group_id
topics
Value type is array
Default value is ["logstash"]
A list of topics to subscribe to, defaults to ["logstash"].