Elasticsearch.06 데이터 색인과 텍스트 분석 Part.2

데이터 색인과 텍스트 분석 Part.2

Elasticsearch.06 데이터 색인과 텍스트 분석 Part.1 - 역인덱스, 텍스트 분석, 애널라이저

Elasticsearch 의 역인덱스 과정은 Analyzer 애널라이저를 통해서 처리된다.

애널라이저는 다음과 같은 구성으로 이뤄지고, 각 역할에 대해 자세히 살펴보자.

Character Filter 캐릭터 필터
Tokenizer 토크나이저
Token Filter 토큰 필터
Stemming 형태소 분서

Elasticsearch 는 텍스트 분석을 하기 위한 Analyzer 애널라이저를 조합하고, 그 동작을 확인할 수 있는 API 와 기능을 제공한다.

_analyze API
term Query
Custom Analyzer
_termvectors API

`_analyze` API

_analyze API 를 활용하면 Analyzer 애널라이저의 동작을 확인할 수 있다.
_analyze API 요청에 입력된 토크나이저 와 토큰 필터 를 기반으로 텍스트 분석된 결과를 확인할 수 있다.
tokenizer 는 토크나이저로 1개만 입력, filter 는 토큰필터로 N개 입력 가능하다.
text 로 입력된 요청 문장은 token 으로 분리되어 응답 정보에 포함된다.

Request

GET _analyze
{
  "text": "The quick brown fox jumps over the lazy dog",
  "tokenizer": "whitespace",
  "filter": [
    "lowercase",
    "stop",
    "snowball"
  ]
}

Response

{
  "tokens" : [
    {
      "token" : "quick",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "brown",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "fox",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "jump",
      "start_offset" : 20,
      "end_offset" : 25,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "over",
      "start_offset" : 26,
      "end_offset" : 30,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "lazi",
      "start_offset" : 35,
      "end_offset" : 39,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "dog",
      "start_offset" : 40,
      "end_offset" : 43,
      "type" : "word",
      "position" : 8
    }
  ]
}

등록된 `Analyzer` 직접 활용한 `_analyze` API

위와 같이 토크나이저 와 토큰필터 를 직접 입력 지정하여 할 수도 있지만, Elasticsearch 에서 이미 등록된 애널라이저 를 이용하면 바로 사용 가능하다.

Request

GET _analyze
{
  "text": "The quick brown fox jumps over the lazy dog",
  "analyzer": "snowball"
}

Response

{
  "tokens" : [
    {
      "token" : "the",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    // ...
    {
      "token" : "dog",
      "start_offset" : 40,
      "end_offset" : 43,
      "type" : "word",
      "position" : 8
    }
  ]
}

`Index` 인덱스에 `Analyzer`적용

인덱스 에도 mappings 설정을 통해 애널라이저 를 적용할 수 있다.
analyzer 요청 정보에 사용하고자 하는 애널라이저 를 지정한다.
애널라이저 가 등록된 인덱스 에 데이터를 저장하면 텍스트 분석 과정을 거치게 된다.
그리고 애널라이저 기반의 생성된 Term 텀 기준으로 데이터 검색이 가능하다.
- jumps 또는 jumping 단어 모두 jump 로 텍스트 분석되기 때문에, 데이터 검색 시 jumps, jumping, jump 로 요청하여도 동일한 결과를 확인할 수 있다.

인덱스에 애널라이저 등록하는 방법

PUT my_index2
{
  "mappings": {
    "properties": {
      "message": {
        "type": "text",
        "analyzer": "snowball"
      }
    }
  }
}

GET my_index2/_search
{
  "query": {
    "match": {
      "message": "jumping"    // `jumps`, `jumping`, `jump` 입력하여 검색한 결과 모두 동일하다.
    }
  }
}

`term` Query

term Query 질의 방식을 사용하면, match Query 질의 방식과 다르게 애널라이저 과정이 적용되지 않는다.
오로지 저장된 Term 텀 을 기준으로 검색하게 된다.

GET my_index2/_search
{
  "query": {
    "term": {
      "message": "jump"   // `jumps`, `jumping` 인 경우, 검색되지 않는다.
    }
  }
}

Custom Analyzer

Index 인덱스 에 Analyzer 애널라이저 를 지정할 때, 토크나이저 와 토큰필터 를 조합하여 등록하는 경우가 많다.
애널라이저 만 Custom 하는 것 뿐 아니라, 토큰필터 도 Custom 하게 지정 가능하다.

`my_index` 인덱스에 `my_custom_analyzer` 애널라이저 등록

PUT my_index
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_custom_analyzer": {
            "type": "custom",
            "tokenizer": "whitespace",
            "filter": [
              "lowercase",
              "stop",
              "snowball"
            ]
          }
        }
      }
    }
  }
}

`my_custom_analyzer` 애널라이저에 `my_stop_filter` 토큰 필터 적용

PUT my_index3
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_custom_analyzer": {
            "type": "custom",
            "tokenizer": "whitespace",
            "filter": [
              "lowercase",
              "my_stop_filter",
              "snowball"
            ]
          }
        },
        "filter": {
          "my_stop_filter": {
            "type": "stop",
            "stopwords": [
              "brown"
            ]
          }
        }
      }
    }
  }
}

Custom 애널라이저를 바로 인덱스에 적용

settings.mappings.message 필드에 지정하여 적용 가능하다.

PUT my_index3
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_custom_analyzer": {
            "type": "custom",
            "tokenizer": "whitespace",
            "filter": [
              "lowercase",
              "my_stop_filter",
              "snowball"
            ]
          }
        },
        "filter": {
          "my_stop_filter": {
            "type": "stop",
            "stopwords": [
              "brown"
            ]
          }
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "message": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  }
}

인덱스 의 설정 정보를 변경하고자 할 때는 삭제 후 다시 생성해야 한다.

`_termvectors` API

색인된 문서의 역인덱스된 내용을 확인할 때, _termvectors API 를 활용하면 된다.

Request

GET my_index3/_termvectors/1?fields=message

Response

{
  "_index" : "my_index3",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "took" : 1,
  "term_vectors" : {
    "message" : {
      "field_statistics" : {
        "sum_doc_freq" : 7,
        "doc_count" : 1,
        "sum_ttf" : 8
      },
      "terms" : {
        "dog" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 8,
              "start_offset" : 40,
              "end_offset" : 43
            }
          ]
        },
        // ...
        "the" : {
          "term_freq" : 2,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 3
            },
            {
              "position" : 6,
              "start_offset" : 31,
              "end_offset" : 34
            }
          ]
        }
      }
    }
  }
}

지금까지 Elasticsearch 의 핵심이라 할 수 있는 Analyzer 애널라이저 동작 또는 구성, 설정을 확인하는 방법 등을 살펴보았다.

출처

김종민님 - Elastic 가이드북
- Elastic 회사의 개발자이셨던 김종민님의 Elastic 가이드북 을 주로 참고하여 문서 작성 계획
Elasticsearch in Action

Share this article:

Elasticsearch.06 데이터 색인과 텍스트 분석 Part.2 - Analyzer 분석

데이터 색인과 텍스트 분석 Part.2

`_analyze` API

Request

Response

등록된 `Analyzer` 직접 활용한 `_analyze` API

Request

Response

`Index` 인덱스에 `Analyzer`적용

인덱스에 애널라이저 등록하는 방법

`term` Query

Custom Analyzer

`my_index` 인덱스에 `my_custom_analyzer` 애널라이저 등록

`my_custom_analyzer` 애널라이저에 `my_stop_filter` 토큰 필터 적용

Custom 애널라이저를 바로 인덱스에 적용

`_termvectors` API

Request

Response

출처

Elasticsearch.06 데이터 색인과 텍스트 분석 Part.1

AWS Webina - 금융사의 클라우드 도입

Elasticsearch.06 데이터 색인과 텍스트 분석 Part.2 - Analyzer 분석

데이터 색인과 텍스트 분석 Part.2

_analyze API

Request

Response

등록된 Analyzer 직접 활용한 _analyze API

Request

Response

Index 인덱스에 Analyzer적용

인덱스에 애널라이저 등록하는 방법

term Query

Custom Analyzer

my_index 인덱스에 my_custom_analyzer 애널라이저 등록

my_custom_analyzer 애널라이저에 my_stop_filter 토큰 필터 적용

Custom 애널라이저를 바로 인덱스에 적용

_termvectors API

Request

Response

출처

Elasticsearch.06 데이터 색인과 텍스트 분석 Part.1

AWS Webina - 금융사의 클라우드 도입

`_analyze` API

등록된 `Analyzer` 직접 활용한 `_analyze` API

`Index` 인덱스에 `Analyzer`적용

`term` Query

`my_index` 인덱스에 `my_custom_analyzer` 애널라이저 등록

`my_custom_analyzer` 애널라이저에 `my_stop_filter` 토큰 필터 적용

`_termvectors` API