重塑功能

Created: November-22, 2018

用於重塑資料的最靈活的基本 R 函式是 reshape。請參閱 ?reshape 的語法。

# create unbalanced longitudinal (panel) data set
set.seed(1234)
df <- data.frame(identifier=rep(1:5, each=3),
                 location=rep(c("up", "down", "left", "up", "center"), each=3),
                 period=rep(1:3, 5), counts=sample(35, 15, replace=TRUE),
                 values=runif(15, 5, 10))[-c(4,8,11),]
df

   identifier location period counts   values
1           1       up      1      4 9.186478
2           1       up      2     22 6.431116
3           1       up      3     22 6.334104
5           2     down      2     31 6.161130
6           2     down      3     23 6.583062
7           3     left      1      1 6.513467
9           3     left      3     24 5.199980
10          4       up      1     18 6.093998
12          4       up      3     20 7.628488
13          5   center      1     10 9.573291
14          5   center      2     33 9.156725
15          5   center      3     11 5.228851

請注意，data.frame 是不平衡的，即單元 2 在第一個週期中缺少觀察，而單元 3 和 4 在第二個週期中缺少觀測值。另請注意，有兩個變數在不同時期變化：計數和值，以及兩個不變的變數：識別符號和位置。

長到寬

要將 data.frame 重新整形為寬格式，

# reshape wide on time variable
df.wide <- reshape(df, idvar="identifier", timevar="period",
                   v.names=c("values", "counts"), direction="wide")
df.wide
   identifier location values.1 counts.1 values.2 counts.2 values.3 counts.3
1           1       up 9.186478        4 6.431116       22 6.334104       22
5           2     down       NA       NA 6.161130       31 6.583062       23
7           3     left 6.513467        1       NA       NA 5.199980       24
10          4       up 6.093998       18       NA       NA 7.628488       20
13          5   center 9.573291       10 9.156725       33 5.228851       11

請注意，缺少的時間段用 NA 填充。

在重新整形時，“v.names”引數指定隨時間變化的列。如果不需要位置變數，則可以在使用 drop 引數重新整形之前刪除它。在從 data.frame 中刪除唯一的非變數/非 id 列時，v.names 引數變得不必要。

reshape(df, idvar="identifier", timevar="period", direction="wide",
        drop="location")

從寬到長

要使用當前的 df.wide 進行重新整形，最小的語法就是

reshape(df.wide, direction="long")

但是，這通常比較棘手：

# remove "." separator in df.wide names for counts and values
names(df.wide)[grep("\\.", names(df.wide))] <-
              gsub("\\.", "", names(df.wide)[grep("\\.", names(df.wide))])

現在，簡單的語法將產生有關未定義列的錯誤。

對於 reshape 函式更難以自動解析的列名，有時需要新增變化引數，該引數告訴 reshape 以寬格式對特定變數進行分組以轉換為長格式。此引數採用變數名稱或索引的向量列表。

reshape(df.wide, idvar="identifier",
        varying=list(c(3,5,7), c(4,6,8)), direction="long")

在重新整形時，可以提供“v.names”引數來重新命名生成的變數變數。

有時可以通過使用 sep 引數來避免變化的規範，該引數告訴 reshape 變數名的哪一部分指定了 value 引數，並指定了 time 引數。