스파르타 데이터분석 2주차

티스토리 뷰

Development/Spartacodingclub-데이터분석

스파르타 데이터분석 2주차

ssdad 2022. 6. 7. 09:14

import matplotlib as mpl

import matplotlib.pyplot as plt

%config InlineBackend.figure_format = 'retina'

!apt -qq -y install fonts-nanum

import matplotlib.font_manager as fm

fontpath = '/usr/share/fonts/truetype/nanum/NanumBarunGothic.ttf'

font = fm.FontProperties(fname=fontpath, size=9)

plt.rc('font', family='NanumBarunGothic')

mpl.font_manager._rebuild()

import pandas as pd

import numpy as np

#판다스를 사용하여 네이버 쇼핑 리뷰 데이터가 존재하는 URL을 입력하고 다운로드합니다.

df = pd.read_table('https://raw.githubusercontent.com/bab2min/corpus/master/sentiment/naver_shopping.txt', names=['ratings', 'reviews'])

# 평점이 3보다 크면 긍정 리뷰, 3과 같거나 3보다 작으면 부정 리뷰이므로

# 평점이 3보다 크면 1로 지정, 3과 같거나 3보다 작으면 0으로 지정

df['label'] = np.select([df.ratings > 3], [1], default=0)

!pip install konlpy # konlpy 패키지 설치

from konlpy.tag import Okt # Okt 모듈 불러오기

tokenizer = Okt() # tokenizer 라는 이름으로 Okt 모듈 사용!

df['tokenized'] = df['reviews'].apply(tokenizer.nouns)

positive_reviews = np.hstack(df[df['label']==1]['tokenized'].values)

negative_reviews = np.hstack(df[df['label']==0]['tokenized'].values)

from wordcloud import WordCloud

import matplotlib.pyplot as plt # 한글폰트 세팅할 때 불러왔었던 패키지!

fontpath = '/usr/share/fonts/truetype/nanum/NanumBarunGothic.ttf'

plt.figure(figsize = (15,15))

positive_data = ' '.join(positive_reviews) # 리스트를 다시 하나의 문자열로 통합하기

wc = WordCloud(max_words = 10000 , width = 1600 , height = 800, font_path = fontpath).generate(positive_data)

plt.imshow(wc, interpolation = 'bilinear')

negative_data = ' '.join(negative_reviews) # 리스트를 다시 하나의 문자열로 통합하기

wc2 = WordCloud(max_words = 10000 , width = 1600 , height = 800, font_path = fontpath).generate(negative_data)

plt.imshow(wc2, interpolation = 'bilinear')

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(df['reviews'], df['label'], test_size = 0.3)

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfTransformer

# dtm 을 만들고,

dtmvector = CountVectorizer()

x_train_dtm = dtmvector.fit_transform(x_train)

# dtm 을 이용해서 tfidf 벡터를 생성

tfidf_transformer = TfidfTransformer()

tfidfv = tfidf_transformer.fit_transform(x_train_dtm)

#모델을 생성하고 학습시키기

# 로지스틱 회귀를 활용한 모델 학습 예시

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(C=10000, penalty='l2')

lr.fit(tfidfv, y_train)

from sklearn.metrics import accuracy_score # 정확도 계산

#테스트 데이터를 활용해서 정확도 확인하기

x_test_dtm = dtmvector.transform(x_test) #테스트 데이터를 DTM으로 변환

tfidfv_test = tfidf_transformer.transform(x_test_dtm) #DTM을 TF-IDF 행렬로 변환

predicted = lr.predict(tfidfv_test) #테스트 데이터에 대한 예측

print("정확도:", accuracy_score(y_test, predicted)) #예측값과 실제값 비교

저작자표시 비영리 변경금지

'Development > Spartacodingclub-데이터분석' 카테고리의 다른 글

스파르타코딩클럽 데이터분석 1주차 (0)	2022.05.25

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2024/05 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

글 보관함

ssdad의 블로그

티스토리 뷰

스파르타 데이터분석 2주차

'Development > Spartacodingclub-데이터분석' 카테고리의 다른 글

티스토리툴바