문장 단위 - 자연어 전처리(문장 토큰화, 품사 태깅, 표제어 추출)

Deep Learning

문장 단위 - 자연어 전처리(문장 토큰화, 품사 태깅, 표제어 추출)

Marlangcow 2024. 10. 3. 21:05

문장 토큰화(Sentence Tokenization)

경우에 따라 코퍼스를 문장 단위로 토큰화한 다음에 문장의 의미를 살려서 분석을 해야 하는 경우가 있다.

대표적으로 '품사 태깅'이 그러한 경우인데, 어떠한 단어의 품사는 그 단어 자체의 의미와 함께 문장 안에서 사용된 위치에 따라 달라질 수 있다. 이런 경우에는 문장 간의 구분이 된 상태에서 단어의 품사를 정해야 하기 때문에 문장 단위로 먼저 토큰화한 후에 품사를 구분해야 한다.

예시 - sent_tokenize()

punkt 모듈을 설치하면 마침표나 약어(Mr. , Dr.)와 같은 언어적인 특성을 고려해서 문장 토큰화가 되기 때문에 단순히 마침표가 있는 곳을 문장으로 나누는 것이 아니라 실제 문장을 잘 구분해 준다.

# 필요한 패키지와 함수 불러오기
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
# from text import TEXT

TEXT = """Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?'
So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.
There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, 'Oh dear! Oh dear! I shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge.
In another moment down went Alice after it, never once considering how in the world she was to get out again.
The rabbit-hole went straight on like a tunnel for some way, and then dipped suddenly down, so suddenly that Alice had not a moment to think about stopping herself before she found herself falling down a very deep well.
Either the well was very deep, or she fell very slowly, for she had plenty of time as she went down to look about her and to wonder what was going to happen next. First, she tried to look down and make out what she was coming to, but it was too dark to see anything; then she looked at the sides of the well, and noticed that they were filled with cupboards and book-shelves; here and there she saw maps and pictures hung upon pegs. She took down a jar from one of the shelves as she passed; it was labelled 'ORANGE MARMALADE', but to her great disappointment it was empty: she did not like to drop the jar for fear of killing somebody, so managed to put it into one of the cupboards as she fell past it.
"""

corpus = TEXT

tokenized_sents = sent_tokenize(corpus)

# 테스트 코드
tokenized_sents

품사 태깅(POS: Part of Speech Tagging)

문장에 사용된 단어의 의미를 제대로 파악하려면 해당 단어가 어떤 품사로 사용되었는지 알아야 한다. 이를 위해 각 단어가 어떤 품사로 쓰였는지 표시하는 작업이 필요한데 이를 품사 태깅(POS: Port of Speech Tagging)이라고 한다.

품사는 문장 안에서 단어가 어떻게 사용되는지에 따라 정해진다. 따라서 여러 문장으로 이루어진 코퍼스에 품사 태깅을 하려면 먼저 코퍼스를 문장으로 구분하고, 각 문장 별로 단어 토큰화를 한 다음, 단어 토큰들 각각에 품사를 태깅해야 한다.

Penn Treebank POS Tags

NLTK의 pos_tag() 함수는 Penn Treebank POS Tags를 기준으로 품사를 태깅한다. Penn Treebank POS Tags는 영어 코퍼스의 품사 정보를 세분화 하여 각 품사에 대응하는 태그를 정리해 둔 것인데, 각 품사 태그의 의미는 아래와 같습니다.

예를 들어, ('After', 'IN')라고 결과가 나온 것은 After가 전치사(Preposition)로 태그되었다는 뜻이고, ('The', 'DT')는 The가 관사(Determiner)로 태그되었다는 뜻이다.

예시 - 품사 태깅

품사 태깅을 위한 함수 pos_tagger()를 완성해 주세요.

pos_tagger()는 파라미터로 문장 토큰화 된 코퍼스를 받습니다.
결과로는 문장 간 경계가 사라진 1차원 [(단어, 품사), (단어, 품사), ...] 리스트를 반환합니다.

import nltk
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from text import TEXT
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

corpus = TEXT

# 문장 토큰화
tokenized_sents = sent_tokenize(corpus)

def pos_tagger(tokenized_sents):
    pos_tagged_words = []

    for sentence in tokenized_sents:
    	# 단어 ㅓ토큰화
        tokenized_words = word_tokenize(sentence)
        
        # 품사 태깅
        pos_tagged = pos_tag(tokenized_words)
        
        # pos_tagged_words에 extend() 함수를 사용해 문장 단위 토근에서 나온 결과를 계속 추가
        pos_tagged_words.extend(pos_tagged)

    return pos_tagged_words

# 테스트 코드
pos_tagger(tokenized_sents)

표제어 추출(Lemmatization)

표제어(Lemma)란 단어의 사전적 어원을 뜻한다. 서로 다른 단어도 표제어는 같은 경우가 있기 때문에, 표제어를 기준으로 통합하면 단어가 정규화된다. 예를 들어 am, are, is는 서로 다른 단어지만 표제어는 동일하게 be이다. 영어 코퍼스에 특히 많은 be 동사들을 모두 표제어로 통합시킨다면 전체 단어의 수가 많이 줄어들 수 있다.

표제어 추출을 하기 위해서는 먼저 토큰화된 단어에 품사 태깅을 진행해야 한다.

pos_tag()는 Penn Treebank POS Tag를 사용하지만 표제어 추출에 사용되는 함수는 WordNet POS Tag를 사용한다. 그래서 pos_tag()로 태깅한 품사를 WordNet POS Tag에 맞게 변환해야 한다.

WordNet POS Tag

WordNet POS Tag란, WordNet이란 거대한 영어 어휘 데이터베이스에 적용되어 있는 품사 태그인데, 다음과 같은 품사 태그가 있다.

위에서 설명한 Penn Treebank POS Tags를 보면 NN, NNp, NNPS처럼 N으로 시작하는 태그는 모두 명사(Noun)을 의미한다. JJ, JJR, JJS와 같이 J로 시작하는 태그는 모두 형용사(Adjective)를 의미한다. Penn Treebank POS Tag를 WordNet POS Tag의 4가지 태그로 손쉽게 바꿀 수 있다.

그러면, penn_to_wn()을 이용해서 품사를 WordNet POS Tag로 바꾸고 표제어 추출까지 해보자. 표제어 추출에는 NLTK의 WordNetLemmatizer 클래스에 있는 lemmatize() 함수를 사용한다. 단어와 품사 태그를 (단어, 품사 태그) 형태로 lemmatize() 함수에 넣으면 표제어가 반환된다.

Penn Treebank POS Tag는 WordNet POS Tag보다 더 많은 품사를 가지고 있다. 때문에, wn_tag에는 WordNet POS Tag로 바뀌지 않은 일부 품사들도 있는데, 그럴 경우에 어떤 식으로 표제어 추출을 할지가 이슈가 된다.

1) WordNet POS Tag에 포함되지 않는 품사를 가진 단어들을 그냥 제거할 수도 있고,

2) 품사 정보 추가 없이 lemmatizer.lemmatize(word)를 바로 해주는 경우도 있고,

3) 표제어 추출을 하지 않고 원형의 단어를 그대로 사용하는 것도 방법이다.

세 가지 방법 중 무엇을 사용해도 괜찮으나 분석하려는 목적과 코퍼스의 특징에 따라 더 좋은 결과가 나올 방식을 선택해서 전처리하면 된다. 아래 코드에서는 WordNet POS Tag에 포함되지 않는 품사를 가진 단어들은 원형 그대로 사용해보았다.

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
nltk.download('wordnet')
nltk.download('omw-1.4')

# Penn Treebank POS Tag를 WordNet POS Tag로 변환
def penn_to_wn(tag):
  if tag.startswith('J'):
    return wn.ADJ
  elif tag.startswith('N'):
    return wn.NOUN
  elif tag.startswith('V'):
    return wn.VERB
  elif tag.startswith('R'):
    return wn.ADV
  else:
    return None

lemmatizer = WordNetLemmatizer()
lemmatized_words = []

# WordNet POS Tag
for word, tag in tagged_words: # (단어, 품사 태그) 형태로 lemmatize() 함수에 넣으면 표제어가 반환됨
  # WordNet Pos Tag로 변환
  wn_tag = penn_to_wn(tag)

  # 품사를 기준으로 표제어 추출
  if wn_tag in (wn.NOUN, wn.ADJ, wn.ADV, wn.VERB):
    lemmatized_words.append(lemmatizer.lemmatize(word, wn_tag)) # lemmatize(): 표제어 추출
  else:
    lemmatized_words.append()
    
# 표제어 추출 확인
print('표제어 추출 전 :', tokenized_words)
print('표제어 추출 후 :', lemmatized_words)

예시 - words_lemmatizer() 함수

표제어 추출 함수인 words_lemmatizer() 함수를 만들어 주세요.

words_lemmatizer() 함수는 (단어, 품사) 형태로 품사 태깅이 된 리스트를 파라미터로 받고, 표제어 추출을 한 결과를 반환합니다.
Penn Treebank POS Tag를 WordNet POS Tag로 바꾸기 위한 함수 penn_to_wn()을 미리 만들어 놨습니다. preprocess.py에서 해당 함수를 불러와 사용할 수 있습니다.
WordNet POS Tag의 품사와 매칭되지 않는 경우 표제어 추출을 하지 않은 원래 단어를 결과에 추가합니다.
penn_to_wn() 함수를 사용해서 품사를 WordNet POS Tag로 바꿔 주세요.

import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
# from preprocess import pos_tagger
# from preprocess import penn_to_wn
# from text import TEXT

# NLTK 데이터 다운로드
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')  # 추가
nltk.download('omw-1.4')  # 추가

# Penn Treebank POS Tag를 WordNet POS Tag로 변환
def penn_to_wn(tag):
  if tag.startswith('J'):
    return wn.ADJ
  elif tag.startswith('N'):
    return wn.NOUN
  elif tag.startswith('V'):
    return wn.VERB
  elif tag.startswith('R'):
    return wn.ADV
  else:
    return None

# 품사 태깅 함수
def pos_tagger(tokenized_sents):
    pos_tagged_words = []

    for sentence in tokenized_sents:
        # 단어 토큰화
        tokenized_words = word_tokenize(sentence)
    
        # 품사 태깅
        pos_tagged = pos_tag(tokenized_words)
        pos_tagged_words.extend(pos_tagged)
    
    return pos_tagged_words

# 표제어 추출
lemmatizer = WordNetLemmatizer()
lemmatized_words = []

# WordNet POS Tag
for word, tag in tagged_words: # (단어, 품사 태그) 형태로 lemmatize() 함수에 넣으면 표제어가 반환됨
  # WordNet Pos Tag로 변환
  wn_tag = penn_to_wn(tag)

  # 품사를 기준으로 표제어 추출
  if wn_tag in (wn.NOUN, wn.ADJ, wn.ADV, wn.VERB):
    lemmatized_words.append(lemmatizer.lemmatize(word, wn_tag)) # lemmatize(): 표제어 추출
  else:
    lemmatized_words.append(word)

TEXT = """Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?'
So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.
There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, 'Oh dear! Oh dear! I shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge.
In another moment down went Alice after it, never once considering how in the world she was to get out again.
The rabbit-hole went straight on like a tunnel for some way, and then dipped suddenly down, so suddenly that Alice had not a moment to think about stopping herself before she found herself falling down a very deep well.
Either the well was very deep, or she fell very slowly, for she had plenty of time as she went down to look about her and to wonder what was going to happen next. First, she tried to look down and make out what she was coming to, but it was too dark to see anything; then she looked at the sides of the well, and noticed that they were filled with cupboards and book-shelves; here and there she saw maps and pictures hung upon pegs. She took down a jar from one of the shelves as she passed; it was labelled 'ORANGE MARMALADE', but to her great disappointment it was empty: she did not like to drop the jar for fear of killing somebody, so managed to put it into one of the cupboards as she fell past it.
"""

corpus = TEXT
tokenized_sents = sent_tokenize(corpus)
pos_tagged_words = pos_tagger(tokenized_sents)

lemmatizer = WordNetLemmatizer()

# 표제어 추출 함수
def words_lemmatizer(pos_tagged_words):
    lemmatized_words = []

    for word, tag in pos_tagged_words:
        wn_tag = penn_to_wn(tag)


        if wn_tag in (wn.NOUN, wn.ADJ, wn.ADV, wn.VERB):
             lemmatized_words.append(lemmatizer.lemmatize(word, wn_tag))
        else:
            lemmatized_words.append(word)


    return lemmatized_words

# 테스트 코드
words_lemmatizer(pos_tagged_words)

자연어 전처리 적용

지금까지 배운 내용들을 실제 코드로 적용해보았다. 다만, 실전에서는 분석에 활용할 코퍼스가 분석에 활용할 좋은 퀄리티가 될 때까지 계속해서 여러 전처리 단계를 반복해야 한다.

import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
from collections import Counter
import pandas as pd
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

from collections import Counter
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
nltk.download('wordnet')
nltk.download('omw-1.4')


# 등장 빈도 기준 정제 함수
def clean_by_freq(tokenized_words, cut_off_count):
    # 파이썬의 Counter 모듈을 통해 단어의 빈도수 카운트하여 단어 집합 생성
    vocab = Counter(tokenized_words)
    
    # 빈도수가 cut_off_count 이하인 단어 set 추출
    uncommon_words = {key for key, value in vocab.items() if value <= cut_off_count}
    
    # uncommon_words에 포함되지 않는 단어 리스트 생성
    cleaned_words = [word for word in tokenized_words if word not in uncommon_words]

    return cleaned_words


# 단어 길이 기준 정제 함수
def clean_by_len(tokenized_words, cut_off_length):
    cleaned_by_freq_len = []
    
    for word in tokenized_words:
        if len(word) > cut_off_length:
            cleaned_by_freq_len.append(word)

    return cleaned_by_freq_len
    
    
# 불용어 제거 함수
def clean_by_stopwords(tokenized_words, stop_words_set):
    cleaned_words = []
    
    for word in tokenized_words:
        if word not in stop_words_set:
            cleaned_words.append(word)
            
    return cleaned_words


# 포터 스테머 어간 추출 함수
def stemming_by_porter(tokenized_words):
    porter_stemmer = PorterStemmer()
    porter_stemmed_words = []

    for word in tokenized_words:
        stem = porter_stemmer.stem(word)
        porter_stemmed_words.append(stem)

    return porter_stemmed_words


# 품사 태깅 함수
def pos_tagger(tokenized_sents):
    pos_tagged_words = []

    for sentence in tokenized_sents:
        # 단어 토큰화
        tokenized_words = word_tokenize(sentence)
    
        # 품사 태깅
        pos_tagged = pos_tag(tokenized_words)
        pos_tagged_words.extend(pos_tagged)
    
    return pos_tagged_words


# Penn Treebank POS Tag를 WordNet POS Tag로 변경
def penn_to_wn(tag):
    if tag.startswith('J'):
        return wn.ADJ
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('R'):
        return wn.ADV
    elif tag.startswith('V'):
        return wn.VERB


# 표제어 추출 함수
def words_lemmatizer(pos_tagged_words):
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = []

    for word, tag in pos_tagged_words:
        wn_tag = penn_to_wn(tag)

        if wn_tag in (wn.NOUN, wn.ADJ, wn.ADV, wn.VERB):
            lemmatized_words.append(lemmatizer.lemmatize(word, wn_tag))
        else:
            lemmatized_words.append(word)

    return lemmatized_words


# 데이터 불러오기
df = pd.read_csv('/data/imdb.tsv', delimiter = "\\t")

# 대소문자 통합
df['review'] = df['review'].str.lower()

# 문장 토큰화
df['sent_tokens'] = df['review'].apply(sent_tokenize)

# 품사 태깅
df['pos_tagged_tokens'] = df['sent_tokens'].apply(pos_tagger)

# 표제어 추출
df['lemmatized_tokens'] = df['pos_tagged_tokens'].apply(words_lemmatizer)

# 추가 전처리
stopwords_set = set(stopwords.words('english'))

df['cleaned_tokens'] = df['lemmatized_tokens'].apply(lambda x: clean_by_freq(x, 1))
df['cleaned_tokens'] = df['cleaned_tokens'].apply(lambda x: clean_by_len(x, 2))
df['cleaned_tokens'] = df['cleaned_tokens'].apply(lambda x: clean_by_stopwords(x, stopwords_set))

df[['cleaned_tokens']]

예를 들어서, 최종 결과에 포함된 ehle는 의미가 명확하지 않은 단어다. 그러면 일단 해당 단어가 포함된 코퍼스를 따로 확인해서 어떤 맥락에서 사용된 단어인지를 체크해야 한다. 만약에 분석에 큰 의미가 없는 단어인 것이 확인된다면 제거하는게 좋다.

해당 단어의 제거는 불용어 세트인 stopwords_set에 ehle라는 단어를 추가해서 불용어 처리하면 되며, 이외에도 불필요한 단어가 결과에 포함된다면 불용어로 처리할 수 있다.

또, n't는 부정의 의미를 나타내기 때문에 분석에 포함시키고 싶다면, 단어를 not으로 정규화할 수 있다. 이렇게 한 번 전처리를 했다고 끝난게 아니라, 결과를 보고 더 나은 결과물이 될 때까지 반복해서 전처리 과정을 수행해야 한다.

자연어 전처리 후 통합하기

최근 자연어 처리 분야에서 사용되는 여러 패키지들은 자연어 전처리 과정부터 모델 적용, 후처리까지 모든 기능을 한 번에 제공하는 경우가 많다. 그런 패키지들을 사용할 때에는 코퍼스를 단어 토큰으로 나눈 형태가 아닌, 토큰화되기 전 코퍼스 원래의 형태로 활용해야 할 수 있다.

그래서 경우에 따라 전처리한 토큰들을 하나의 코퍼스로 통합하는 과정이 필요하다.

df[['cleaned_tokens']]

def combine(sentence):
    return ' '.join(sentence)

df['combined_corpus'] = df['cleaned_tokens'].apply(combine)

df[['combined_corpus']]

현재글문장 단위 - 자연어 전처리(문장 토큰화, 품사 태깅, 표제어 추출)

Marlangcow의 Datalab