介紹如何在 Python 中使用 sas7bdat
模組讀取 SAS7BDAT 格式的 SAS 檔案。
sas7bdat
模組Python 的 sas7bdat
模組可以透過 pip
指令來安裝:
# 安裝 sas7bdat 模組
pip install sas7bdat
這裡我們以 airline.sas7bdat 這個檔案來示範如何使用 Python 的 sas7bdat
模組來讀取 SAS7BDAT 格式的 SAS 檔案。
首先以 sas7bdat
模組的 SAS7BDAT
函數開啟 SAS7BDAT 檔案:
# 引入 sas7bdat 模組的 SAS7BDAT 函數 from sas7bdat import SAS7BDAT # 開啟 SAS7BDAT 格式的 SAS 檔案 reader = SAS7BDAT('airline.sas7bdat')
reader.header
中包含了詳細的標頭資訊:
# 標頭資訊 print(reader.header)
Header: col_count_p1: 6 col_count_p2: 0 column_count: 6 compression: None creator: Written by SAS creator_proc: None date_created: 2008-05-13 15:25:11 date_modified: 2008-05-13 15:25:11 endianess: little file_type: DATA filename: airline.sas7bdat header_length: 1024 lcp: 0 lcs: 16 mix_page_row_count: 32 name: AIRLINE os_name: os_type: page_count: 1 page_length: 4096 platform: windows row_count: 32 row_length: 44 sas_release: 9.0000M0 server_type: WIN u64: False Contents of dataset "AIRLINE": Num Name Type Length Format Label --- ------- ------ ------ ------ ------------------ 1 YEAR number 4 year 2 Y number 8 level of output 3 W number 8 wage rate 4 R number 8 interest rate 5 L number 8 labor input 6 K number 8 capital input
透過 reader.columns
可以取得 SAS 變數的屬性(variable attributes):
# 取得變數屬性(Variable Attributes) for c in reader.columns: print("col_id:", c.col_id) print(" name:", c.name.decode(encoding = "utf-8")) print(" label:", c.label.decode(encoding = "utf-8")) print(" format:", c.format) print(" type:", c.type) print(" length:", c.length)
col_id: 0 name: YEAR label: year format: type: number length: 4 col_id: 1 name: Y label: level of output format: type: number length: 8 col_id: 2 name: W label: wage rate format: type: number length: 8 col_id: 3 name: R label: interest rate format: type: number length: 8 col_id: 4 name: L label: labor input format: type: number length: 8 col_id: 5 name: K label: capital input format: type: number length: 8
實際的資料內容則可透過 reader
取得:
# 讀取資料 for r in reader: print(r)
['YEAR', 'Y', 'W', 'R', 'L', 'K'] [1948.0, 1.2139999866485596, 0.24300000071525574, 0.1454000025987625, 1.4149999618530273, 0.6119999885559082] [1949.0, 1.3539999723434448, 0.25999999046325684, 0.21809999644756317, 1.3839999437332153, 0.5590000152587891] [1950.0, 1.569000005722046, 0.27799999713897705, 0.3156999945640564, 1.3880000114440918, 0.5730000138282776] [1951.0, 1.9479999542236328, 0.296999990940094, 0.39399999380111694, 1.5499999523162842, 0.5640000104904175] [1952.0, 2.265000104904175, 0.3100000023841858, 0.35589998960494995, 1.8020000457763672, 0.5740000009536743] [1953.0, 2.7309999465942383, 0.32199999690055847, 0.35929998755455017, 1.9259999990463257, 0.7110000252723694] [1954.0, 3.0250000953674316, 0.33500000834465027, 0.4025000035762787, 1.9639999866485596, 0.7760000228881836] [1955.0, 3.562000036239624, 0.3499999940395355, 0.3961000144481659, 2.115999937057495, 0.8270000219345093] [1956.0, 3.9790000915527344, 0.3610000014305115, 0.3822000026702881, 2.434999942779541, 0.800000011920929] [1957.0, 4.420000076293945, 0.3790000081062317, 0.304500013589859, 2.7070000171661377, 0.9210000038146973] [1958.0, 4.563000202178955, 0.39100000262260437, 0.32839998602867126, 2.7060000896453857, 1.0670000314712524] [1959.0, 5.385000228881836, 0.4259999990463257, 0.3856000006198883, 2.8459999561309814, 1.0829999446868896] [1960.0, 5.553999900817871, 0.4410000145435333, 0.31929999589920044, 3.0889999866485596, 1.4809999465942383] [1961.0, 5.465000152587891, 0.46000000834465027, 0.30790001153945923, 3.121999979019165, 1.7359999418258667] [1962.0, 5.824999809265137, 0.48500001430511475, 0.3783000111579895, 3.184000015258789, 1.9259999990463257] [1963.0, 6.875999927520752, 0.5059999823570251, 0.4180000126361847, 3.263000011444092, 2.0409998893737793] [1964.0, 7.822999954223633, 0.5379999876022339, 0.5163000226020813, 3.4119999408721924, 1.996999979019165] [1965.0, 9.119999885559082, 0.5640000104904175, 0.5878999829292297, 3.622999906539917, 2.256999969482422] [1966.0, 10.51200008392334, 0.5860000252723694, 0.536899983882904, 4.073999881744385, 2.742000102996826] [1967.0, 13.020000457763672, 0.621999979019165, 0.44429999589920044, 4.710000038146973, 3.563999891281128] [1968.0, 15.26099967956543, 0.6660000085830688, 0.3052000105381012, 5.2170000076293945, 4.767000198364258] [1969.0, 16.312999725341797, 0.7310000061988831, 0.23319999873638153, 5.568999767303467, 6.511000156402588] [1970.0, 16.00200080871582, 0.8309999704360962, 0.1882999986410141, 5.494999885559082, 7.626999855041504] [1971.0, 15.87600040435791, 0.906000018119812, 0.20229999721050262, 5.334000110626221, 8.67300033569336] [1972.0, 16.66200065612793, 1.0, 0.25060001015663147, 5.34499979019165, 8.331000328063965] [1973.0, 17.013999938964844, 1.055999994277954, 0.2667999863624573, 5.6620001792907715, 8.557000160217285] [1974.0, 19.30500030517578, 1.13100004196167, 0.266400009393692, 5.729000091552734, 9.508000373840332] [1975.0, 18.72100067138672, 1.246999979019165, 0.23010000586509705, 5.7220001220703125, 9.062000274658203] [1976.0, 19.25, 1.375, 0.3452000021934509, 5.76200008392334, 8.26200008392334] [1977.0, 20.64699935913086, 1.5440000295639038, 0.45080000162124634, 5.876999855041504, 7.473999977111816] [1978.0, 22.72599983215332, 1.7029999494552612, 0.5877000093460083, 6.107999801635742, 7.104000091552734] [1979.0, 23.618999481201172, 1.7790000438690186, 0.534600019454956, 6.8520002365112305, 6.874000072479248]
若習慣以 Pandas 來處理資料的話,亦可直接將 SAS7BDAT 檔案的內容轉為 Pandas 的 DataFrame 資料格式:
# 將資料直接轉換為 Pandas DataFrame
df = reader.to_data_frame()
df
YEAR Y W R L K 0 1948.0 1.214000 0.243 0.1454 1.415 0.612 1 1949.0 1.354000 0.260 0.2181 1.384 0.559 2 1950.0 1.569000 0.278 0.3157 1.388 0.573 3 1951.0 1.948000 0.297 0.3940 1.550 0.564 4 1952.0 2.265000 0.310 0.3559 1.802 0.574 5 1953.0 2.731000 0.322 0.3593 1.926 0.711 6 1954.0 3.025000 0.335 0.4025 1.964 0.776 7 1955.0 3.562000 0.350 0.3961 2.116 0.827 8 1956.0 3.979000 0.361 0.3822 2.435 0.800 9 1957.0 4.420000 0.379 0.3045 2.707 0.921 10 1958.0 4.563000 0.391 0.3284 2.706 1.067 11 1959.0 5.385000 0.426 0.3856 2.846 1.083 12 1960.0 5.554000 0.441 0.3193 3.089 1.481 13 1961.0 5.465000 0.460 0.3079 3.122 1.736 14 1962.0 5.825000 0.485 0.3783 3.184 1.926 15 1963.0 6.876000 0.506 0.4180 3.263 2.041 16 1964.0 7.823000 0.538 0.5163 3.412 1.997 17 1965.0 9.120000 0.564 0.5879 3.623 2.257 18 1966.0 10.512000 0.586 0.5369 4.074 2.742 19 1967.0 13.020000 0.622 0.4443 4.710 3.564 20 1968.0 15.261000 0.666 0.3052 5.217 4.767 21 1969.0 16.313000 0.731 0.2332 5.569 6.511 22 1970.0 16.002001 0.831 0.1883 5.495 7.627 23 1971.0 15.876000 0.906 0.2023 5.334 8.673 24 1972.0 16.662001 1.000 0.2506 5.345 8.331 25 1973.0 17.014000 1.056 0.2668 5.662 8.557 26 1974.0 19.305000 1.131 0.2664 5.729 9.508 27 1975.0 18.721001 1.247 0.2301 5.722 9.062 28 1976.0 19.250000 1.375 0.3452 5.762 8.262 29 1977.0 20.646999 1.544 0.4508 5.877 7.474 30 1978.0 22.726000 1.703 0.5877 6.108 7.104 31 1979.0 23.618999 1.779 0.5346 6.852 6.874
若要將資料匯出為逗點分隔(CSV)檔案,可以使用 convert_file
函數,而其 delimiter
可以指定欄位分隔字元(預設為逗號):
# 將資料匯出為 CSV 檔案 reader.convert_file("output.csv", delimiter=',')
檔案使用完畢後,呼叫 close
關閉檔案:
# 關閉檔案
reader.close()
sas7bdat_to_csv
指令稿工具在安裝 Python 的 sas7bdat
模組時,會附帶安裝一個 sas7bdat_to_csv
指令稿工具,它可以用來將 SAS7BDAT 格式的 SAS 檔案轉換為逗點分隔(CSV)檔案:
# 將 SAS7BDAT 檔案轉換為 CSV 檔案
sas7bdat_to_csv airline.sas7bdat
[airline.sas7bdat] ⟶ [airline.csv] wrote 32 of 32 lines Successfully converted 1 of 1 file
新產生的 CSV 檔案會儲存在 airline.csv
。