Table of Contents |
---|
I am looking into Log2Vec as a TD-IDF alternative to log vectorization. Primarily, I’ll be interested in consuming Sysmon logs later.
Log2Vec
Abstract—Logs are one of the most valuable data sources for large-scale service management. Log representation, which converts unstructured texts to structured vectors or matrices, serves as the the first step towards automated log analysis. However, the current log representation methods neither represent domain-specific semantic information of logs, nor handle the outof-vocabulary (OOV) words of new types of logs at runtime. We propose Log2Vec, a semantic-aware representation framework for log analysis. Log2Vec combines a log-specific word embedding method to accurately extract the semantic information of logs, with an OOV word processor to embed OOV words into vectors at runtime. We present an analysis on the impact of OOV words and evaluate the performance of the OOV word processor. The evaluation experiments on four public production log datasets demonstrate that Log2Vec not only fixes the issue presented by OOV words, but also significantly improves the performance of two popular log-based service management tasks, including log classification and anomaly detection. We have packaged Log2Vec into an open-source toolkit and hope that it can be used for future research.
https://github.com/NetManAIOps/Log2Vec
The work was supported by National Key R&D Program of China (Grant No. 2019YFB1802504, 2018YFB1800405), the National Natural Science Foundation of China (Grant Nos. 61772307, 61902200 and 61402257), the China Postdoctoral Science Foundation (2019M651015) and the Beijing National Research Center for Information Science and Technology (BNRist).
Paper
Our paper is published on The 29th International Conference on Computer Communications and Networks (ICCCN 2020,). The information can be found here:
...
I use separate conda
environments for older “older” research-grade software. Software moves at a rapid pace.
It requires a little bit of software engineering skill.
https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#
The following dependencies need to be present:
Code Block |
---|
1. nltk, nltk.download("wordnet") 2. spacy, spacy.load("en_core_web_md") 3. progressbar 4. dynet (python3) |
gensim 3.x
A C++ compiler tool chain (I document my ML env below)
dynet
...
Last release is from 2020 (state of this information ). In Python 3.12 disutils
became deprecated, which will cause build errors.
...
I chose to use 3.9, but your requirements may be stricter. This way the error can be avoided.
Code Block |
---|
conda create --name log2vec python=3.9 conda activate log2vec pip install 'setuptools<57.0.0' pip install --verbose dynet --no-build-isolation |
...
build-essential and cmake (Linux Mint 21)
This is straight forwardstraightforward, but I document the compiler version here for the sake of completeness.
...
Code Block |
---|
conda install anaconda::spacy conda install conda-forge::spacy-model-en_core_web_md |
Build the C++ project (make)
Code Block |
---|
(log2vec) marius@mleng:~/source/Log2Vec/code/LRWE/src$ make clean rm -rf word2vec lrcwe (log2vec) marius@mleng:~/source/Log2Vec/code/LRWE/src$ make -j 4 g++ word2vec.c -o word2vec -lm -pthread -Ofast -march=native -Wall -funroll-loops -Wno-unused-result g++ lrcwe.c -o lrcwe -lm -pthread -Ofast -march=native -Wall -funroll-loops -Wno-unused-result |
gensim 3.x
4.x introduced changes. Using version 3.x avoids errors with breaking changes.
Code Block |
---|
conda install conda-forge::gensim=3.8.3 |
Test trace
To get familiar with the approach:
Expand | ||
---|---|---|
|
Code Block |
---|
(log2vec) marius@mleng:~/source/Log2Vec$ python log2vec.py -i results -t HDFS # no errors (log2vec) marius@mleng:~/source/Log2Vec$ python code/preprocessing.py -rawlog ./code/data/BGL.log rawlogs:./code/data/BGL.log variables have been removed logs without variables:./code/data/BGL_without_variables.log (log2vec) marius@mleng:~/source/Log2Vec$ python code/get_syn_ant.py -logs ./code/data/BGL_without_variables.log -ant_file ./middle/ants.txt input: ./code/data/BGL_without_variables.log syn_file ./middle/syns.txt ant_file ./middle/ants.txt (log2vec) marius@mleng:~/source/Log2Vec$ python code/get_triplet.py data/BGL_without_variables.log middle/bgl_triplet.txt (log2vec) marius@mleng:~/source/Log2Vec$ (log2vec) marius@mleng:~/source/Log2Vec$ python code/getTempLogs.py -input data/BGL_without_variables.log -output middle/BGL_without_variables_for_training.log input: data/BGL_without_variables.log output: middle/BGL_without_variables_for_training.log (log2vec) marius@mleng:~/source/Log2Vec/code/LRWE/src$ ./lrcwe -train ../../../middle/BGL_without_variables_for_training.log alpha:0.050000, alpha_syn:0.025000, alpha_ant:0.001000, alpha_rel:0.010000 belta_syn:0.700000, belta_ant:0.200000, belta_rel:0.800000 Starting training using file ../../../middle/BGL_without_variables_for_training.log train_file: ../../../middle/BGL_without_variables_for_training.log word_num:0 Vocab size: 1 Words in train file: 1 (log2vec) marius@mleng:~/source/Log2Vec/code/LRWE/src$ ./lrcwe -train /home/marius/source/Log2Vec/middle/BGL_without_variables_for_training.log -synonym /home/marius/source/Log2Vec/middle/syns.txt -antonym /home/marius/source/Log2Vec/middle/ants.txt -output /home/marius/source/Log2Vec/middle/bgl_words.model -save-vocab /home/marius/source/Log2Vec/middle/bgl.vocab -belta-rel 0.8 - alpha-rel 0.01 -alpha-ant 0.3 -size 32 -min-count 1 /home/marius/source/Log2Vec/middle/bgl_triplet.txt alpha:0.050000, alpha_syn:0.025000, alpha_ant:0.300000, alpha_rel:0.010000 belta_syn:0.700000, belta_ant:0.200000, belta_rel:0.800000 Starting training using file /home/marius/source/Log2Vec/middle/BGL_without_variables_for_training.log train_file: /home/marius/source/Log2Vec/middle/BGL_without_variables_for_training.log word_num:0 Vocab size: 1 Words in train file: 1 synonyms file total line: 0, words: 0, ignore words: 407 antonyms file total line: 0, words: 0, ignore words: 10 (log2vec) marius@mleng:~/source/Log2Vec$ python code/mimick/make_dataset.py --vectors /home/marius/source/Log2Vec/middle/bgl_words.model --w2v-format --output /home/marius/source/Log2Vec/middle/bgl_words.pkl Total in Embeddings vocabulary: 1 Training set character count: 4 (log2vec) marius@mleng:~/source/Log2Vec$ python code/mimick/model.py --dataset /home/marius/source/Log2Vec/middle/bgl_words.pkl --vocab code/mimick/testdir/testvocab.txt --output middle/oov.vector [dynet] random seed: 1179517440 [dynet] allocating memory: 512MB [dynet] memory allocation done. 100% | | [lr=0.01 clips=0 updates=0] None The dy.parameter(...) call is now DEPRECATED. | There is no longer need to explicitly add parameters to the computation graph. Any used parameter will be added automatically. 100% |############################################################################################| 100% | | [lr=0.01 clips=0 updates=0] None 100% |############################################################################################| 100% | | [lr=0.01 clips=0 updates=0] None 100% |############################################################################################| 100% | | [lr=0.01 clips=0 updates=0] None 100% |############################################################################################| 100% | | [lr=0.01 clips=0 updates=0] None 100% |############################################################################################| 100% | | [lr=0.01 clips=0 updates=0] None 100% |############################################################################################| 100% | | [lr=0.01 clips=0 updates=0] None 100% |############################################################################################| 100% | | [lr=0.01 clips=0 updates=0] None 100% |############################################################################################| 100% | | [lr=0.01 clips=0 updates=0] None 100% |############################################################################################| 100% | | [lr=0.01 clips=0 updates=0] None 100% |############################################################################################| |
Very interesting. This appears to be a multistaged and very advanced vectorization technique.