熱線電話：13121318867

登錄

首頁精彩閱讀教你對抓取的文本進行分詞、詞頻統計、詞云可視化和情感分析

教你對抓取的文本進行分詞、詞頻統計、詞云可視化和情感分析

2022-02-09

收藏

作者：Python進階者

來源：Python爬蟲與數據挖掘

前言

前幾天有個叫【小明】的粉絲在問了一道關于Python處理文本可視化+語義分析的問題。

他要構建語料庫，目前通過Python網絡爬蟲抓到的數據存在一個csv文件里邊，現在要把數據放進txt里，表示不會，然后還有后面的詞云可視化，分詞，語義分析等，都不太會。

一、思路

內容稍微有點多，大體思路如下，先將csv中的文本取出，之后使用停用詞做分詞處理，再做詞云圖，之后做情感分析。

1、將csv文件中的文本逐行取出，存新的txt文件，這里運行代碼《讀取csv文件中文本并存txt文檔.py》進行實現，得到文件《職位表述文本.txt》

2、運行代碼《使用停用詞獲取最后的文本內容.py》，得到使用停用詞獲取最后的文本內容，生成文件《職位表述文本分詞后_outputs.txt》

3、運行代碼《指定txt詞云圖.py》，可以得到詞云圖；

4、運行代碼《jieba分詞并統計詞頻后輸出結果到Excel和txt文檔.py》，得到《wordCount_all_lyrics.xls》和《分詞結果.txt》文件，將《分詞結果.txt》中的統計值可以去除，生成《情感分析用詞.txt》，給第五步情感分析做準備

5、運行代碼《情感分析.py》，得到情感分析的統計值，取平均值可以大致確認情感是正還是負。

二、實現過程

1.將csv文件中的文本逐行取出，存新的txt文件

這里運行代碼《讀取csv文件中文本并存txt文檔.py》進行實現，得到文件《職位表述文本.txt》，代碼如下。

# coding: utf-8

import pandas as pd

df = pd.read_csv('./職位描述.csv', encoding='gbk')

# print(df.head())

for text in df['Job_Description']:

# print(text)

if text is not None:

with open('職位表述文本.txt', mode='a', encoding='utf-8') as file:

file.write(str(text))

print('寫入完成')

2.使用停用詞獲取最后的文本內容

運行代碼《使用停用詞獲取最后的文本內容.py》，得到使用停用詞獲取最后的文本內容，生成文件《職位表述文本分詞后_outputs.txt》，代碼如下：

#!/usr/bin/env python3

# -*- coding: utf-8 -*-

import jieba

# jieba.load_userdict('userdict.txt')

# 創建停用詞list

def stopwordslist(filepath):

stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]

return stopwords

# 對句子進行分詞

def seg_sentence(sentence):

sentence_seged = jieba.cut(sentence.strip())

stopwords = stopwordslist('stop_word.txt') # 這里加載停用詞的路徑

outstr = ''

for word in sentence_seged:

if word not in stopwords:

if word != 't':

outstr += word

outstr += " "

return outstr

inputs = open('職位表述文本.txt', 'r', encoding='utf-8')

outputs = open('職位表述文本分詞后_outputs.txt', 'w', encoding='utf-8')

for line in inputs:

line_seg = seg_sentence(line) # 這里的返回值是字符串

outputs.write(line_seg + 'n')

outputs.close()

inputs.close()

關鍵節點，都有相應的注釋，你只需要替換對應的txt文件即可，如果有遇到編碼問題，將utf-8改為gbk即可解決。

3.制作詞云圖

運行代碼《指定txt詞云圖.py》，可以得到詞云圖，代碼如下：

from wordcloud import WordCloud

import jieba

import numpy

import PIL.Image as Image

def cut(text):

wordlist_jieba=jieba.cut(text)

space_wordlist=" ".join(wordlist_jieba)

return space_wordlist

with open(r"C:UserspdcfiDesktopxiaoming職位表述文本.txt" ,encoding="utf-8")as file:

text=file.read()

text=cut(text)

mask_pic=numpy.array(Image.open(r"C:UserspdcfiDesktopxiaomingpython.png"))

wordcloud = WordCloud(font_path=r"C:/Windows/Fonts/simfang.ttf",

collocations=False,

max_words= 100,

min_font_size=10,

max_font_size=500,

mask=mask_pic).generate(text)

image=wordcloud.to_image()

# image.show()

wordcloud.to_file('詞云圖.png') # 把詞云保存下來

如果想用你自己的圖片，只需要替換原始圖片即可。這里使用Python底圖做演示，得到的效果如下：

4.分詞統計

運行代碼《jieba分詞并統計詞頻后輸出結果到Excel和txt文檔.py》，得到《wordCount_all_lyrics.xls》和《分詞結果.txt》文件，將《分詞結果.txt》中的統計值可以去除，生成《情感分析用詞.txt》，給第五步情感分析做準備，代碼如下：

#!/usr/bin/env python3

# -*- coding:utf-8 -*-

import sys

import jieba

import jieba.analyse

import xlwt # 寫入Excel表的庫

# reload(sys)

# sys.setdefaultencoding('utf-8')

if __name__ == "__main__":

wbk = xlwt.Workbook(encoding='ascii')

sheet = wbk.add_sheet("wordCount") # Excel單元格名字

word_lst = []

key_list = []

for line in open('職位表述文本.txt', encoding='utf-8'): # 需要分詞統計的原始目標文檔

item = line.strip('nr').split('t') # 制表格切分

# print item

tags = jieba.analyse.extract_tags(item[0]) # jieba分詞

for t in tags:

word_lst.append(t)

word_dict = {}

with open("分詞結果.txt", 'w') as wf2: # 指定生成文件的名稱

for item in word_lst:

if item not in word_dict: # 統計數量

word_dict[item] = 1

else:

word_dict[item] += 1

orderList = list(word_dict.values())

orderList.sort(reverse=True)

# print orderList

for i in range(len(orderList)):

for key in word_dict:

if word_dict[key] == orderList[i]:

wf2.write(key + ' ' + str(word_dict[key]) + 'n') # 寫入txt文檔

key_list.append(key)

word_dict[key] = 0

for i in range(len(key_list)):

sheet.write(i, 1, label=orderList[i])

sheet.write(i, 0, label=key_list[i])

wbk.save('wordCount_all_lyrics.xls') # 保存為 wordCount.xls文件

得到的txt和excel文件如下所示：

5.情感分析的統計值

運行代碼《情感分析.py》，得到情感分析的統計值，取平均值可以大致確認情感是正還是負，代碼如下：

#!/usr/bin/env python3

# -*- coding: utf-8 -*-

from snownlp import SnowNLP

# 積極/消極

# print(s.sentiments) # 0.9769551298267365 positive的概率

def get_word():

with open("情感分析用詞.txt", encoding='utf-8') as f:

line = f.readline()

word_list = []

while line:

line = f.readline()

word_list.append(line.strip('rn'))

f.close()

return word_list

def get_sentiment(word):

text = u'{}'.format(word)

s = SnowNLP(text)

print(s.sentiments)

if __name__ == '__main__':

words = get_word()

for word in words:

get_sentiment(word)

# text = u'''

# 也許

# '''

# s = SnowNLP(text)

# print(s.sentiments)

# with open('lyric_sentiments.txt', 'a', encoding='utf-8') as fp:

# fp.write(str(s.sentiments)+'n')

# print('happy end')

基于NLP 語義分析，程序運行之后，得到的情感得分值如下圖所示：

將得數取平均值，一般滿足0.5分以上，說明情感是積極的，這里經過統計之后，發現整體是積極的。

四、總結

我是Python進階者。本文基于粉絲提問，針對一次文本處理，手把手教你對抓取的文本進行分詞、詞頻統計、詞云可視化和情感分析，算是完成了一個小項目了。下次再遇到類似這種問題或者小的課堂作業，不妨拿本項目練練手，說不定有妙用噢，拿個高分不在話下！

CDA數據分析師考試相關入口一覽（建議收藏）：

? 想報名CDA認證考試，點擊>>> “CDA報名” 了解CDA考試詳情；

? 想學習CDA考試教材，點擊>>> “CDA教材” 了解CDA考試詳情；

? 想加入CDA考試題庫，點擊>>> “CDA題庫” 了解CDA考試詳情；

? 想了解CDA考試含金量，點擊>>> “CDA含金量” 了解CDA考試詳情；

情感分析詞云圖 NLP python 語義分析 numpy pandas 網絡爬蟲

數據分析咨詢請掃描二維碼

若不方便掃碼，搜微信號：CDAshujufenxi

上一篇干貨 | 全網最新最全Pyecharts可視化教程

下一篇遇到100萬行的 Excel，還沒打開，電腦和我都崩潰了，該怎么辦？

數據分析師考試動態

考試介紹
考試大綱
考試內容
考試地點

CDA報考指南

報考流程
考試時間
報名費用
聯系我們

數據分析學習

數據分析師資訊

更多

Copyright © 2015-2021, www.ruiqisteel.com All Rights Reserved. CDA數據分析師(北京國富如荷網絡科技有限公司) 版權所有京ICP備11001960號-9

京公網安備 11010802034615號經營許可證編號：京B2-20210330

聯系電話：13321103290 (微信同號)

OK

免費資料
免費試聽
訂制課程
職業規劃
認證考試

客服在線

日韩人妻系列无码专区视频,先锋高清无码,无码免费视欧非,国精产品一区一区三区无码

客服在線

立即咨詢

免密碼登錄

提交首次登錄驗證后自動注冊