MLOps 부트캠프 by 한경+토스뱅크/Machine Learning

단어 단위 - 자연어 전처리(정제, 불용어, 정규화, 어간 추출)

나니니 2024. 10. 2. 06:30

자연어 전처리란?

자연어 데이터를 사용할 때, 전처리를 어떻게 하냐에 따라 분석 결과가 크게 달라진다. 

예를 들어, 'Oh, Hi helo. Nice to meetyou.' 문장에서 맞춤법과 띄어쓰기 수정, 그리고 의미 표현에 크게 기여하지 않는 'Oh' 제거, 중첩된 유의어 제거(Hi, Hello), 각 단어에 숫자 인덱스 부여 등의 작업을 거치고나면 {'Hi':0, 'Nice':1, 'to':2, 'meet':3, 'you':4} 와 같이 분석에 활용하기 좋은 형태가 된다. 이러한 과정을 자연어 전처리라고 한다. 

 

자연어 전처리 과정

  • 토큰화: 자연어 데이터를 분석을 위한 작은 단위(토큰)로 분리한다.
  • 정제: 분석에 큰 의미가 없는 데이터들을 제거한다.
  • 정규화: 표현 방법이 다르지만 의미가 같은 단어들을 통합시킨다.
  • 정수 인코딩: 컴퓨터가 이해하기 쉽도록 자연어 데이터에 정수 인덱스를 부여한다. 

자연어 전처리 방법에는 정해진 표준이 없다. 분석의 목적과 활용할 장녀어 데이터의 특성에 따라 적용해야 하는 전처리 단계가 다르고, 각 단계를 적용하는 순서에도 차이가 생길 수 있다. 그리고 원하는 전처리의 결과물이 어떤 형태여야 하는지도 분석하는 상황에 따라 모두 다르다. 따라서 긱본적인 자연어 전처리 방법들을 상황에 맞춰 융통성 있게 사용해야 한다. 

단어 토큰화(Word Tokenization)

분석에 활용하기 위한 자연어 데이터를 코퍼스(Corpus)라고 하며, 한국어로는 말뭉치라고 한다. 

코퍼스를 분석에 활용하려면 먼저 의미있는 작은 단위로 나눠야 하는데, '의미있는 작은 단위'토큰(Token)이라고 하고 하나의 코퍼스를 여러 개의 토큰으로 나누는 과정을 토큰화(Tokenization)라고 한다. 

단어 토큰화

# NLTK 설치(영어 자연어 처리)
conda install nltk
import nltk

# 필요한 패키지와 함수 불러오기
from nltk.tokenize imoprt word_tokenize 

nltk.download('punkt') # punkt 다운로드: 마침표, 약어(Mr., Dr.) 고려하여 토큰화 해주는 모듈

# 토큰화 할 문장
TEXT = """Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?'
So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.
There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, 'Oh dear! Oh dear! I shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge.
In another moment down went Alice after it, never once considering how in the world she was to get out again.
The rabbit-hole went straight on like a tunnel for some way, and then dipped suddenly down, so suddenly that Alice had not a moment to think about stopping herself before she found herself falling down a very deep well.
Either the well was very deep, or she fell very slowly, for she had plenty of time as she went down to look about her and to wonder what was going to happen next. First, she tried to look down and make out what she was coming to, but it was too dark to see anything; then she looked at the sides of the well, and noticed that they were filled with cupboards and book-shelves; here and there she saw maps and pictures hung upon pegs. She took down a jar from one of the shelves as she passed; it was labelled 'ORANGE MARMALADE', but to her great disappointment it was empty: she did not like to drop the jar for fear of killing somebody, so managed to put it into one of the cupboards as she fell past it.
"""

corpus = TEXT

# 단어 토큰화
tokenized_words = word_tokenize(text) # word_tokenize(): 코퍼스를 파라미터로 넘겨서 토큰화 된 단어 리스트를 반환하는 함수

print(tokenized_words)

 

word_tokenize()는 어퍼스트로피(')나 콤마(,)를 토큰화의 기준으로 사용하고 있으며, 하이픈(-)은 토큰화의 기준으로 사용하고 있지 않은 것을 알 수 있다. 

 

*이 외 다른 단어 토큰화 함수 알아보기(링크)

정제(Cleaning)

코퍼스에는 아무 의미도 없거나 분석의 목적에 적합하지 않은 단어들도 포함된다. 이런 단어들은 전처리 과정에서 제거해야 하는데 그 과정을 정제(Cleaning)라고 한다. 

등장 빈도가 적은 단어

import nltk
from nltk.tokenize import word_tokenize
from text import TEXT
nltk.download('punkt')

corpus = TEXT
print(corpus)

# 출력결과
# TEXT = """After reading the comments for this movie, I am not sure whether I should be angry, sad or sickened. Seeing comments typical of people who a)know absolutely nothing about the military or b)who base everything they think they know on movies like this or on CNN reports about Abu-Gharib makes me wonder about the state of intellectual stimulation in the world. At the time I type this the number of people in the US military: 1.4 million on Active Duty with another almost 900,000 in the Guard and Reserves for a total of roughly 2.3 million. The number of people indicted for abuses at at Abu-Gharib: Currently less than 20 That makes the total of people indicted .00083% of the total military. Even if you indict every single military member that ever stepped in to Abu-Gharib, you would not come close to making that a whole number.  The flaws in this movie would take YEARS to cover. I understand that it's supposed to be sarcastic, but in reality, the writer and director are trying to make commentary about the state of the military without an enemy to fight. In reality, the US military has been at its busiest when there are not conflicts going on. The military is the first called for disaster relief and humanitarian aid missions. When the tsunami hit Indonesia, devestating the region, the US military was the first on the scene. When the chaos of the situation overwhelmed the local governments, it was military leadership who looked at their people, the same people this movie mocks, and said make it happen. Within hours, food aid was reaching isolated villages. Within days, airfields were built, cargo aircraft started landing and a food distribution system was up and running. Hours and days, not weeks and months. Yes there are unscrupulous people in the US military. But then, there are in every walk of life, every occupation. But to see people on this website decide that 2.3 million men and women are all criminal, with nothing on their minds but thoughts of destruction or mayhem is an absolute disservice to the things that they do every day. One person on this website even went so far as to say that military members are in it for personal gain. Wow! Entry level personnel make just under $8.00 an hour assuming a 40 hour work week. Of course, many work much more than 40 hours a week and those in harm's way typically put in 16-18 hour days for months on end. That makes the pay well under minimum wage. So much for personal gain. I beg you, please make yourself familiar with the world around you. Go to a nearby base, get a visitor pass and meet some of the men and women you are so quick to disparage. You would be surprised. The military no longer accepts people in lieu of prison time. They require a minimum of a GED and prefer a high school diploma. The middle ranks are expected to get a minimum of undergraduate degrees and the upper ranks are encouraged to get advanced degrees."""

# Counter모듈 import - Counter()함수: 단어의 빈도 수 계산
from collections import Counter

# 전체 단어 토큰 리스트
tokenized_words = word_tokenize(corpus)

# 파이썬의 Counter 모듈을 통해 단어의 빈도수 카운트하여 단어 집합 생성
	# Counter()는 파라미터로 단어 리스트를 받고, 각 단어의 등장 빈도를 딕셔너리({단어: 등장 횟수}) 형태로 반환함
vocab = Counter(tokenized_words)
print(vocab)

print(vocab)

 

# 빈도수가 2 이하인 단어 리스트 추출
uncommon_words = [key for key, value in vocab.items() if value <= 2]

print('빈도수가 2 이하인 단어 수:', len(uncommon_words))

print('빈도수가 2 이하인 단어 수:', len(uncommon_words))

# 빈도수가 2 이하인 단어들만 제거한 결과를 따로 저장
cleaned_by_freq = [word for word in tokenized_words if word not in uncommon_words]

print('빈도수 3 이상인 토큰 수:', len(cleaned_by_freq))

 

print('빈도수 3 이상인 토큰 수:', len(cleaned_by_freq))

 

 코퍼스의 특징과 분석의 목적에 따라 몇 회 이하 등장하는 단어를 정제할 지는 달라진다.정제의 기준이 되는 숫자는 가장 적절한 숫자를 임의로 설정하면 된다.다만, 최적의 정제 기준을 한번에 찾기는 어려우므로 동일한 코퍼스를 여러번 전처리해보면서 가장 좋은 결과가 나올 때의 정제 기준을 활용하는게 일반적이다. 이렇게 여러번 전처리 과정을 반복하여 가장 좋은 결과가 나오는 기준을 찾는 방식은 모든 전처리 단계에서 동일하게 적용된다.

길이가 짧은 단어

영어 단어의 경우, 알파벳 하나 또는 두개로 구성된 단어는 코퍼스의 의미를 나타내는데 중요하지 않을 가능성이 높다. 그러므로 이런 단어들은 제거하는 것이 좋다.

# 길이가 2 이하인 단어 제거
cleaned_by_freq_len = []

for word in cleaned_by_freq:
    if len(word) > 2:
        cleaned_by_freq_len.append(word)
        
# 정제 결과 확인
print('정제 전:', cleaned_by_freq[:10])
print('정제 후:', cleaned_by_freq_len[:10])

 콤마, I, be 등 큰 의미를 갖지 않는 단어들이 잘 제거된 것을 확인할 수 있다.

예시 - 등장 빈도, 단어 길이 함수

더보기

등장 빈도와 단어의 길이를 입력받고, 입력받은 수 이하인 토큰을 정제하는 함수 clean_by_freq()와 clean_by_len()을 만들어 주세요.

clean_by_freq()는 단어 토큰화된 코퍼스(tokenized_words)와 정제할 등장 빈도 기준(cut_off_count)을 파라미터로 받습니다.
clean_by_len()은 단어 토큰화 된 코퍼스(tokenized_words)와 정제할 단어 길이 기준(cut_off_length)을 파라미터로 받습니다.
두 함수 모두 정제 후 남은 단어 토큰 리스트를 결과로 반환합니다.
실제로 만든 함수가 잘 동작하는지 확인하기 위한 실행 코드도 완성해 주세요. clean_by_freq()는 파라미터로 단어 토큰화 된 리스트와 cut_off_count 값으로 2를 넣어주시고, clean_by_len()은 파라미터로 clean_by_freq()의 결과와 cut_off_length 값 2를 추가해 주세요.

import nltk
from nltk.tokenize import word_tokenize
from collections import Counter
# from text import TEXT
nltk.download('punkt')

TEXT = """Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?'
So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.
There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, 'Oh dear! Oh dear! I shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge.
In another moment down went Alice after it, never once considering how in the world she was to get out again.
The rabbit-hole went straight on like a tunnel for some way, and then dipped suddenly down, so suddenly that Alice had not a moment to think about stopping herself before she found herself falling down a very deep well.
Either the well was very deep, or she fell very slowly, for she had plenty of time as she went down to look about her and to wonder what was going to happen next. First, she tried to look down and make out what she was coming to, but it was too dark to see anything; then she looked at the sides of the well, and noticed that they were filled with cupboards and book-shelves; here and there she saw maps and pictures hung upon pegs. She took down a jar from one of the shelves as she passed; it was labelled 'ORANGE MARMALADE', but to her great disappointment it was empty: she did not like to drop the jar for fear of killing somebody, so managed to put it into one of the cupboards as she fell past it.
"""

corpus = TEXT
tokenized_words = word_tokenize(corpus)

def clean_by_freq(tokenized_words, cut_off_count):
    vocab = Counter(tokenized_words)

    # 빈도수가 cut_off_count 이하인 단어를 제거
    uncommon_words = {key for key, value in vocab.items() if value <= cut_off_count}
    cleaned_words = [word for word in tokenized_words if word not in uncommon_words]

    return cleaned_words


def clean_by_len(tokenized_words, cut_off_length): 
# cleaned_words = [word for word in tokenized_words if len(word) > cut_off_length] 와 동일
	cleaned_words = []

    for word in tokenized_words:
        # 길이가 cut_off_length 이하인 단어 제거
        if len(word) > cut_off_length:
            cleaned_words.append(word)


    return cleaned_words


# 문제의 조건에 맞게 함수를 호출해 주세요
clean_by_freq = clean_by_freq(tokenized_words, 2)
cleaned_words = clean_by_len(clean_by_freq, 2)

cleaned_words

 

불용어(Stopwords)

불용어란 코퍼스에서 큰 의미가 없거나, 분석 목적에서 벗어나는 단어들을 말하며 정확한 분석을 방해하므로 제거해주어야 한다. 

불용어 정의하기

불용어 제거는 아래와 같은 방식으로 진행된다.

  • 불용어 세트 준비
  • 코퍼스의 각 단어 토큰이 불용어 세트에 포함되는지 확인
  • 불용어 세트에 있는 단어 토큰은 분석에서 제외

그런데 매 분석시마다 불용어 세트를 만드는 것은 번거롭고, 코퍼스의 각 종류에 상관 없이 많이 사용되는 불용어들이 있기 때문에 이를 필요할 때마다 불러와서 사용하면 훨씬 편리하다. NLTK는 기본 불용어 목록 179개를 제공하고 stopwords.words('english')로 접근할 수 있다. 

from nltk.corpus import stopwords
nltk.download('stopwords')

stopwords_set = set(stopwords.words('english'))

print('불용어 개수 :', len(stopwords_set))
print(stopwords_set)

 

일반적인 코퍼스에서 많이 사용되지만 분석에 크게 활용되지 않는 단어들이 불용어 세트에 포함되어 있는 것을 확인할 수 있다. 

 

경우에 따라서는 NLTK에서 기본 제공하는 불용어에 새로운 단어를 추가하거나, 일부 단어를 기본 불용어 목록에서 제거해야 할 수도 있다. 이런 경우에는 세트 데이터 형식의 add(), remove() 함수를 사용해볼 수 있다. 

stopwords_set.add('hello')
stopwords_set.remove('the')
stopwords_set.remove('me')

print('불용어 개수 :', len(stopwords_set))
print('불용어 출력 :',stopwords_set)

 

NLTK에서 받아온 불용어 세트에 hello가 추가되었고, the와 me가 제거되었다. 

 

또 NLTK가 기본 제공하는 불용어 외에도 새로운 불용어 세트를 정의하여 사용할 수 있다. 

my_stopwords_set = {'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves'}

print(my_stopwords_set)

예시 - 불용어 제거 함수 만들기

더보기

불용어 제거를 위한 함수 clean_by_stopwords()를 만들어 주세요.

- clean_by_stopwords()는 파라미터로 단어 토큰화된 코퍼스(tokenized_words)와 불용어 목록(stopwords_set)을 받습니다.
- 결과로는 불용어가 제거된 단어 토큰 리스트를 반환합니다.
- 불용어 목록은 NLTK에서 제공하는 기본 불용어 목록 세트를 받아와 사용합니다.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# from text import TEXT
nltk.download('stopwords')
nltk.download('punkt')

TEXT = """After reading the comments for this movie, I am not sure whether I should be angry, sad or sickened. Seeing comments typical of people who a)know absolutely nothing about the military or b)who base everything they think they know on movies like this or on CNN reports about Abu-Gharib makes me wonder about the state of intellectual stimulation in the world. At the time I type this the number of people in the US military: 1.4 million on Active Duty with another almost 900,000 in the Guard and Reserves for a total of roughly 2.3 million. The number of people indicted for abuses at at Abu-Gharib: Currently less than 20 That makes the total of people indicted .00083% of the total military. Even if you indict every single military member that ever stepped in to Abu-Gharib, you would not come close to making that a whole number.  The flaws in this movie would take YEARS to cover. I understand that it's supposed to be sarcastic, but in reality, the writer and director are trying to make commentary about the state of the military without an enemy to fight. In reality, the US military has been at its busiest when there are not conflicts going on. The military is the first called for disaster relief and humanitarian aid missions. When the tsunami hit Indonesia, devestating the region, the US military was the first on the scene. When the chaos of the situation overwhelmed the local governments, it was military leadership who looked at their people, the same people this movie mocks, and said make it happen. Within hours, food aid was reaching isolated villages. Within days, airfields were built, cargo aircraft started landing and a food distribution system was up and running. Hours and days, not weeks and months. Yes there are unscrupulous people in the US military. But then, there are in every walk of life, every occupation. But to see people on this website decide that 2.3 million men and women are all criminal, with nothing on their minds but thoughts of destruction or mayhem is an absolute disservice to the things that they do every day. One person on this website even went so far as to say that military members are in it for personal gain. Wow! Entry level personnel make just under $8.00 an hour assuming a 40 hour work week. Of course, many work much more than 40 hours a week and those in harm's way typically put in 16-18 hour days for months on end. That makes the pay well under minimum wage. So much for personal gain. I beg you, please make yourself familiar with the world around you. Go to a nearby base, get a visitor pass and meet some of the men and women you are so quick to disparage. You would be surprised. The military no longer accepts people in lieu of prison time. They require a minimum of a GED and prefer a high school diploma. The middle ranks are expected to get a minimum of undergraduate degrees and the upper ranks are encouraged to get advanced degrees.
"""

corpus = TEXT
tokenized_words = word_tokenize(TEXT)

# NLTK에서 제공하는 불용어 목록을 세트 자료형으로 받아와 주세요
stopwords_set = set(stopwords.words('english'))

def clean_by_stopwords(tokenized_words, stopwords_set):
    cleaned_words = []

    for word in tokenized_words:
        if word not in stopwords_set:
          cleaned_words.append(word)

    return cleaned_words

# 테스트 코드
clean_by_stopwords(tokenized_words, stopwords_set)

정규화(Normalization)

US, USA, U.S., America..처럼 형태가 다르지만 같은 의미를 나타내는 단어들이 많아질 수록 코퍼스는 복잡해지고 분석이 어려워진다. 그러므로 의미가 같은 단어를 하나의 형태로 통일하는 것이 좋은데 이를 정규화(Normalization이라고 한다. 

정규화에는 여러 방법이 있으나 보편적으로 대소문자 통합과 규칙 기반 정규화가 주로 사용된다. 

대소문자 통합

대부분의 프로그래밍 언어는 대소문자를 구분한다. 그러므로 코퍼스를 대문자 또는 소문자 중 하나로 통일하면 정규화가 된다. 영어 문법 상 대문자는 특수한 상황에서만 사용되고, 보통은 소문자가 주로 사용된다. 따라서 대문자를 소문자로 바꾸는게 일반적이다.

text = "What can I do for you? Do your homework now."

# 소문자로 변환
print(text.lower())

 

do와 Do는 원래 다른 단어였지만, 소문자로 변환하여 완전히 같은 형태가 되었다. 

 

규칙 기반 정규화

USA, US, U.S.는 형태가 다르나 의미는 같다. 표준어는 아니지만 Umm과 Ummmmm도 같은 의미이기 때문에 정규화할 수 있다. 이런 단어들은 규칙을 정의해서 하나의 표현으로 통합할 수 있다. 

# 동의어 사전
synonym_dict = {'US':'USA', 'U.S':'USA', 'Ummm':'Umm', 'Ummmm':'Umm' }
text = "She became a US citizen. Ummmm, I think, maybe and or."
normalized_words = []

# 단어 토큰화
tokenized_words = nltk.word_tokenize(text)

for word in tokenized_words:
    # 동의어 사전에 있는 단어라면, value에 해당하는 값으로 변환
    if word in synonym_dict.keys():
        word = synonym_dict[word]

    normalized_words.append(word)

# 결과 확인
print(normalized_words)

 

US는 USA로, Ummmm은 Umm으로 잘 변경되었다. 

 

어간 추출(Stemming)

특정한 단어의 핵심이 되는 부분을 어간(Stem)이라고 하며, 단어에서 어간을 찾아내는 것을 어간 추출(Stemming)이라고 한다. 서로 다른 형태의 단어들도 어간 추출을 하면 같은 단어로 통합되기 때문에 이를 정규화 방법 중 하나로 사용한다. 

아래는 어간 추출 알고리즘 중 하나인 포터 스테머 알고리즘(Porter Stemmer Algorithm)의 규칙 일부이다. 단순이 어미만 잘라내는 방식으로 어간을 찾고 있어 사전에 없는 단어가 결과로 나오기도 한다. 

  • alize → al (Formalize → Formal)
  • ational → ate (Relational -> Relate)
  • ate → 제거 (Activate -> Activ)
  • ment → 제거 (Encouragement -> Encourage)

위 예시에서 Activate의 ate를 제거해 찾은 어간 activ는 사전에 없는 단어다. ate를 제거하고 뒤에 e를 붙여줘야 완전한 단어가 되지만 섬세하게 처리해주지는 못하는 것을 알 수 있다. 따라서 코퍼스의 특성이나 분석하는 상황에 따라 어간 추출을 하는게 적합한 지를 잘 판단해야 한다. 그렇지 않으면 분석에 활용되어야 하는 중요한 단어가 손실될 수 있다.

 

NLTK는 어간 추출을 위한 알고리즘으로 포터 스테머(Porter Stemmer)와 랭커스터 스테머(Lancaster Stemmer)를 제공한다. 두 알고리즘은 어간 추출을 하는 기준이 미세하게 다르기 때문에 무엇을 사용하느냐에 따라 결과가 조금씩 달라진다. 하지만 사용 방식은 거의 동일하므로 포터 스테머 알고리즘을 사용하는 방법만 자세히 알아보자.

# 포터 스테머(Porter Stemmer) 알고리즘
from nltk.stem import PorterStemmer

porter_stemmer = PorterStemmer()
text = "You are so lovely. I am lovng you now."
porter_stemmed_words = []

# 단어 토큰화
tokenized_words = nltk.word_tokenize(text)

# 포터 스테머의 어간 추출
for word in tokenized_words:
  stem = porter_stemmer.stem(word) 
  	# NLTK의 porter_stemmer.stem() 함수: 단어가 포터 스테머 알고리즘의 기준에 포함되면 추출된 어간을 반환하고, 그렇지 않은 경우에는 원래의 단어를 반환
  porter_stemmed_words.append(stem)

# 결과 확인
print('어간 추출 전 :', tokenized_words)
print('포터 스테머의 어간 추출 후 :', porter_stemmed_words)

 

lovely와 loving이 love로 어간 추출 되었습니다. 서로 다른 두 개의 단어가 하나의 어간으로 정규화 되ek.
랭커스터 스테머 알고리즘은 동일한 코드에 적용하는 함수만 바꾸면 된다.

예시 - 포터 스테머 알고리즘

더보기

포터 스테머 알고리즘으로 어간을 추출하는 함수 stemming_by_porter()를 만들어주세요.

  • stemming_by_porter() 함수는 파라미터로 토큰화한 코퍼스(tokenized_words)가 전달됩니다.
  • 결과로는 어간이 추출된 토큰 리스트가 반환됩니다.
# 필요한 패키지와 함수 불러오기
import nltk
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
# from text import TEXT
nltk.download('punkt')

TEXT = """After reading the comments for this movie, I am not sure whether I should be angry, sad or sickened. Seeing comments typical of people who a)know absolutely nothing about the military or b)who base everything they think they know on movies like this or on CNN reports about Abu-Gharib makes me wonder about the state of intellectual stimulation in the world. At the time I type this the number of people in the US military: 1.4 million on Active Duty with another almost 900,000 in the Guard and Reserves for a total of roughly 2.3 million. The number of people indicted for abuses at at Abu-Gharib: Currently less than 20 That makes the total of people indicted .00083% of the total military. Even if you indict every single military member that ever stepped in to Abu-Gharib, you would not come close to making that a whole number.  The flaws in this movie would take YEARS to cover. I understand that it's supposed to be sarcastic, but in reality, the writer and director are trying to make commentary about the state of the military without an enemy to fight. In reality, the US military has been at its busiest when there are not conflicts going on. The military is the first called for disaster relief and humanitarian aid missions. When the tsunami hit Indonesia, devestating the region, the US military was the first on the scene. When the chaos of the situation overwhelmed the local governments, it was military leadership who looked at their people, the same people this movie mocks, and said make it happen. Within hours, food aid was reaching isolated villages. Within days, airfields were built, cargo aircraft started landing and a food distribution system was up and running. Hours and days, not weeks and months. Yes there are unscrupulous people in the US military. But then, there are in every walk of life, every occupation. But to see people on this website decide that 2.3 million men and women are all criminal, with nothing on their minds but thoughts of destruction or mayhem is an absolute disservice to the things that they do every day. One person on this website even went so far as to say that military members are in it for personal gain. Wow! Entry level personnel make just under $8.00 an hour assuming a 40 hour work week. Of course, many work much more than 40 hours a week and those in harm's way typically put in 16-18 hour days for months on end. That makes the pay well under minimum wage. So much for personal gain. I beg you, please make yourself familiar with the world around you. Go to a nearby base, get a visitor pass and meet some of the men and women you are so quick to disparage. You would be surprised. The military no longer accepts people in lieu of prison time. They require a minimum of a GED and prefer a high school diploma. The middle ranks are expected to get a minimum of undergraduate degrees and the upper ranks are encouraged to get advanced degrees.
"""

corpus = TEXT
tokenized_words = word_tokenize(corpus)

# 포터 스테머의 어간 추출
def stemming_by_porter(tokenized_words):
    porter_stemmer = PorterStemmer()
    porter_stemmed_words = []

    for word in tokenized_words:
        stem = porter_stemmer.stem(word) # 어간 추출
        porter_stemmed_words.append(stem) # 추출된 어간(stem)을 리스트 porter_stemmed_words에 저장

    return porter_stemmed_words

# 테스트 코드
stemming_by_porter(tokenized_words)