Log2Vec conda Python 3.9

 

I am looking into Log2Vec as a TD-IDF alternative to log vectorization. Primarily, I’ll be interested in consuming Sysmon logs later.

Log2Vec

 

Abstract—Logs are one of the most valuable data sources for large-scale service management. Log representation, which converts unstructured texts to structured vectors or matrices, serves as the the first step towards automated log analysis. However, the current log representation methods neither represent domain-specific semantic information of logs, nor handle the outof-vocabulary (OOV) words of new types of logs at runtime. We propose Log2Vec, a semantic-aware representation framework for log analysis. Log2Vec combines a log-specific word embedding method to accurately extract the semantic information of logs, with an OOV word processor to embed OOV words into vectors at runtime. We present an analysis on the impact of OOV words and evaluate the performance of the OOV word processor. The evaluation experiments on four public production log datasets demonstrate that Log2Vec not only fixes the issue presented by OOV words, but also significantly improves the performance of two popular log-based service management tasks, including log classification and anomaly detection. We have packaged Log2Vec into an open-source toolkit and hope that it can be used for future research.

 

https://github.com/NetManAIOps/Log2Vec

The work was supported by National Key R&D Program of China (Grant No. 2019YFB1802504, 2018YFB1800405), the National Natural Science Foundation of China (Grant Nos. 61772307, 61902200 and 61402257), the China Postdoctoral Science Foundation (2019M651015) and the Beijing National Research Center for Information Science and Technology (BNRist).

Paper

Our paper is published on The 29th International Conference on Computer Communications and Networks (ICCCN 2020,). The information can be found here:

  • Weibin Meng, Ying Liu, Yuheng Huang, Shenglin Zhang, Federico Zaiter, Bingjin Chen, Dan Pei. A Semantic-aware Representation Framework for Online Log Analysis. ICCCN 2020. August 3 - August 6, 2020, Honolulu, Hawaii, USA.

 

Install Log2Vec on Linux Mint 21 AMD64 in 2024

I use separate conda environments for “older” research-grade software. Software moves at a rapid pace.

It requires a little bit of software engineering skill.

 

The following dependencies need to be present:

1. nltk, nltk.download("wordnet") 2. spacy, spacy.load("en_core_web_md") 3. progressbar 4. dynet (python3)
  • gensim 3.x

  • A C++ compiler tool chain (I document my ML env below)

dynet

Last release is from 2020 (state of this information Jan 15, 2024 ). In Python 3.12 disutils became deprecated, which will cause build errors.

I chose to use 3.9, but your requirements may be stricter. This way the error can be avoided.

conda create --name log2vec python=3.9 conda activate log2vec pip install 'setuptools<57.0.0' pip install --verbose dynet --no-build-isolation

 

build-essential and cmake (Linux Mint 21)

This is straightforward, but I document the compiler version here for the sake of completeness.

apt install build-essential apt install cmake (log2vec) marius@mleng:~/source/Log2Vec/code/LRWE/src$ dpkg -l | grep build-essential ii build-essential 12.9ubuntu3 amd64 Informational list of build-essential packages (log2vec) marius@mleng:~/source/Log2Vec/code/LRWE/src$ dpkg -l | grep cmake ii cmake 3.22.1-1ubuntu1.22.04.1 amd64 cross-platform, open-source make system ii cmake-data 3.22.1-1ubuntu1.22.04.1 all CMake data files (modules, templates and documentation)

nltk and wordnet

spacy

Build the C++ project (make)

gensim 3.x

4.x introduced changes. Using version 3.x avoids errors with breaking changes.

Test trace

To get familiar with the approach.

  • I had to use some absolute paths later because it was getting late. Test success.

 

 

Very interesting. This appears to be a multistaged and very advanced vectorization technique.

conda env as YAML (Python 3.9, 16.1.2024)

This allows to build a fully functional environment with Log2Vec based on Python 3.9. The original release was 3.6. There will be some deprecation warnings, but I believe they can be safely ignored.

Code: gist.githubusercontent.com/norandom/a1fd048d7d870a90aa72c9c45fd44e02/raw/f8c6ad9c5470b5380d4bcea8eaa237dd64217f9d/conda_env_log2vec.yml

Wrapper for the Log2Vec libraries for automated Log file vectorization

This allows to use the Log2Vec library for automated log file vectorization based on the semantic embedding and NLP approach demonstrated in the paper.

Code:

gist.githubusercontent.com/norandom/86a701a56b7de8c800a83eac293da813/raw/a9c7db1d46be633f344b4a07ff05d8985530b162/log2vec_wrapper.sh

Understanding the .vector versus the .log

The format is line-based, with up to 32 vector dimensions (per line)

A header will be added with the number of lines (samples) and the dimensions (32). Therefore, there is one additional line.

The vectors can be consumed by an ML pipeline.