Python利用多進程將大量數據放入有限內存的教程-CDA數據分析師官網

熱線電話：13121318867

登錄

首頁精彩閱讀Python利用多進程將大量數據放入有限內存的教程

Python利用多進程將大量數據放入有限內存的教程

2018-01-31

收藏

Python利用多進程將大量數據放入有限內存的教程

這是一篇有關如何將大量的數據放入有限的內存中的簡略教程。

與客戶工作時，有時會發現他們的數據庫實際上只是一個csv或Excel文件倉庫，你只能將就著用，經常需要在不更新他們的數據倉庫的情況下完成工作。大部分情況下，如果將這些文件存儲在一個簡單的數據庫框架中或許更好，但時間可能不允許。這種方法對時間、機器硬件和所處環境都有要求。

下面介紹一個很好的例子：假設有一堆表格（沒有使用Neo4j、MongoDB或其他類型的數據庫，僅僅使用csvs、tsvs等格式存儲的表格），如果將所有表格組合在一起，得到的數據幀太大，無法放入內存。所以第一個想法是：將其拆分成不同的部分，逐個存儲。這個方案看起來不錯，但處理起來很慢。除非我們使用多核處理器。
目標

這里的目標是從所有職位中（大約1萬個），找出相關的的職位。將這些職位與政府給的職位代碼組合起來。接著將組合的結果與對應的州（行政單位）信息組合起來。然后用通過word2vec生成的屬性信息在我們的客戶的管道中增強已有的屬性。

這個任務要求在短時間內完成，誰也不愿意等待。想象一下，這就像在不使用標準的關系型數據庫的情況下進行多個表的連接。
數據

示例腳本

下面的是一個示例腳本，展示了如何使用multiprocessing來在有限的內存空間中加速操作過程。腳本的第一部分是和特定任務相關的，可以自由跳過。請著重關注第二部分，這里側重的是multiprocessing引擎。

#import the necessary packages
import pandas as pd
import us
import numpy as np
from multiprocessing import Pool,cpu_count,Queue,Manager

# the data in one particular column was number in the form that horrible excel version
# of a number where '12000' is '12,000' with that beautiful useless comma in there.
# did I mention I excel bothers me?
# instead of converting the number right away, we only convert them when we need to
def median_maker(column):
return np.median([int(x.replace(',','')) for x in column])

# dictionary_of_dataframes contains a dataframe with information for each title; e.g title is 'Data Scientist'
# related_title_score_df is the dataframe of information for the title; columns = ['title','score']
### where title is a similar_title and score is how closely the two are related, e.g. 'Data Analyst', 0.871
# code_title_df contains columns ['code','title']
# oes_data_df is a HUGE dataframe with all of the Bureau of Labor Statistics(BLS) data for a given time period (YAY FREE DATA, BOO BAD CENSUS DATA!)

def job_title_location_matcher(title,location):
try:
    related_title_score_df = dictionary_of_dataframes[title]
    # we limit dataframe1 to only those related_titles that are above
    # a previously established threshold
    related_title_score_df = related_title_score_df[title_score_df['score']>80]

    #we merge the related titles with another table and its codes
    codes_relTitles_scores = pd.merge(code_title_df,related_title_score_df)
    codes_relTitles_scores = codes_relTitles_scores.drop_duplicates()

    # merge the two dataframes by the codes
    merged_df = pd.merge(codes_relTitles_scores, oes_data_df)
    #limit the BLS data to the state we want
    all_merged = merged_df[merged_df['area_title']==str(us.states.lookup(location).name)]

    #calculate some summary statistics for the time we want
    group_med_emp,group_mean,group_pct10,group_pct25,group_median,group_pct75,group_pct90 = all_merged[['tot_emp','a_mean','a_pct10','a_pct25','a_median','a_pct75','a_pct90']].apply(median_maker)
    row = [title,location,group_med_emp,group_mean,group_pct10,group_pct25, group_median, group_pct75, group_pct90]
    #convert it all to strings so we can combine them all when writing to file
    row_string = [str(x) for x in row]
    return row_string
except:
    # if it doesnt work for a particular title/state just throw it out, there are enough to make this insignificant
    'do nothing'

這里發生了神奇的事情：
#runs the function and puts the answers in the queue
def worker(row, q):
    ans = job_title_location_matcher(row[0],row[1])
    q.put(ans)

# this writes to the file while there are still things that could be in the queue
# this allows for multiple processes to write to the same file without blocking eachother
def listener(q):
f = open(filename,'wb')
while 1:
    m = q.get()
    if m =='kill':
        break
    f.write(','.join(m) + 'n')
    f.flush()
f.close()

def main():
#load all your data, then throw out all unnecessary tables/columns
filename = 'skill_TEST_POOL.txt'

#sets up the necessary multiprocessing tasks
manager = Manager()
q = manager.Queue()
pool = Pool(cpu_count() + 2)
watcher = pool.map_async(listener,(q,))

jobs = []
#titles_states is a dataframe of millions of job titles and states they were found in
for i in titles_states.iloc:
    job = pool.map_async(worker, (i, q))
    jobs.append(job)

for job in jobs:
    job.get()
q.put('kill')
pool.close()
pool.join()

if __name__ == "__main__":
main()

由于每個數據幀的大小都不同（總共約有100Gb），所以將所有數據都放入內存是不可能的。通過將最終的數據幀逐行寫入內存，但從來不在內存中存儲完整的數據幀。我們可以完成所有的計算和組合任務。這里的“標準方法”是，我們可以僅僅在“job_title_location_matcher”的末尾編寫一個“write_line”方法，但這樣每次只會處理一個實例。根據我們需要處理的職位/州的數量，這大概需要2天的時間。而通過multiprocessing，只需2個小時。

雖然讀者可能接觸不到本教程處理的任務環境，但通過multiprocessing，可以突破許多計算機硬件的限制。本例的工作環境是c3.8xl ubuntu ec2，硬件為32核60Gb內存（雖然這個內存很大，但還是無法一次性放入所有數據）。這里的關鍵之處是我們在60Gb的內存的機器上有效的處理了約100Gb的數據，同時速度提升了約25倍。通過multiprocessing在多核機器上自動處理大規模的進程，可以有效提高機器的利用率。也許有些讀者已經知道了這個方法，但對于其他人，可以通過multiprocessing能帶來非常大的收益。

CDA數據分析師考試相關入口一覽（建議收藏）：

? 想報名CDA認證考試，點擊>>> “CDA報名” 了解CDA考試詳情；

? 想學習CDA考試教材，點擊>>> “CDA教材” 了解CDA考試詳情；

? 想加入CDA考試題庫，點擊>>> “CDA題庫” 了解CDA考試詳情；

? 想了解CDA考試含金量，點擊>>> “CDA含金量” 了解CDA考試詳情；

numpy pandas 關系型數據庫數據倉庫

數據分析咨詢請掃描二維碼

若不方便掃碼，搜微信號：CDAshujufenxi

上一篇當我用python爬了公司BD王同事的微信好友...

下一篇數據分析方法之分解分析介紹

數據分析師考試動態

考試介紹
考試大綱
考試內容
考試地點

CDA報考指南

報考流程
考試時間
報名費用
聯系我們

數據分析學習

數據分析師資訊

更多

Copyright © 2015-2021, www.ruiqisteel.com All Rights Reserved. CDA數據分析師(北京國富如荷網絡科技有限公司) 版權所有京ICP備11001960號-9

京公網安備 11010802034615號經營許可證編號：京B2-20210330

聯系電話：13321103290 (微信同號)

OK

免費資料
免費試聽
訂制課程
職業規劃
認證考試

客服在線

日韩人妻系列无码专区视频,先锋高清无码,无码免费视欧非,国精产品一区一区三区无码

客服在線

立即咨詢

免密碼登錄

提交首次登錄驗證后自動注冊