I'm often testing forecasting models, and what this ends up creating is a bunch of "forecasted" variables that are paired with the "actual" values. R has fabulous faceting capabilities, and I have often wanted to reshape the data in a way where the category of forecasted variable as an identifier, and then two columns that list the forecasted and actual variables. In other words, if the code starts from something like
aAct aPred bAct bPred id 1 1.2076384 -0.6735547 1.4994464 -1.0691975 1 2 0.4999706 -0.7188215 -0.3601551 0.7224729 2 3 1.0340859 -0.1108304 -0.5941295 0.5027085 3And I want to convert it where one column has an id, another one identifies whether I'm forecasting a or b, and a third column that has the forecasted value, and then a fourth column with the actual value.
The procedure in R involves "melting" the data frame and then "casting" it. Melting is rather simple -- you provide a set of identifiers, and then the data frame is melted down to only that identifier, the values, and another indicator variable that tells you what the value is supposed to represent. In the above example, if we let df be the data frame described above, I would run:
df.m = melt(df, id.vars = 'id')
id variable value
1 1 aAct 1.2076384
2 2 aAct 0.4999706
3 3 aAct 1.0340859
4 1 aPred -0.6735547
5 2 aPred -0.7188215
6 3 aPred -0.1108304
7 1 bAct 1.4994464
8 2 bAct -0.3601551
9 3 bAct -0.5941295
10 1 bPred -1.0691975
11 2 bPred 0.7224729
12 3 bPred 0.5027085
Now I need to "unmelt" part of the data frame to get the forecast/actual pairings. In R, this is known as casting and I know that I personally had a pretty hard time decoding the documentation. The function goes along ascast(df.m, <IDENTIFIERS> ~ <VALUES>)
The second part is known as the casting formula and is the part that I have struggled with. But in its most simplest form, the casted frame will look like something with all the identifiers added together as uniquely identifying units ,and then the <VALUES> variables being the labels for the actual value column. If that sounded confusing, I apologize. Perhaps solving the example would help.
First, I need to find a way to identify whether a row is looking at a or b, and whether it is a forecast or an actual variable. So I first create these variables:
df.m$var = substring(df.m$variable, 1, 1)
df.m$type = substring(df.m$variable, 2)
Which gives me the data frame:
> df.m
id variable value type var
1 1 aAct 1.2076384 Act a
2 2 aAct 0.4999706 Act a
3 3 aAct 1.0340859 Act a
4 1 aPred -0.6735547 Pred a
5 2 aPred -0.7188215 Pred a
6 3 aPred -0.1108304 Pred a
7 1 bAct 1.4994464 Act b
8 2 bAct -0.3601551 Act b
9 3 bAct -0.5941295 Act b
10 1 bPred -1.0691975 Pred b
11 2 bPred 0.7224729 Pred b
12 3 bPred 0.5027085 Pred b
Now I can cast the frame. In this case, I would use the formula
df.mc = cast(df.m, id + var ~ type)
This is how you interpret the formula. Id + var means that every observation is uniquely identified by it's id code and the variable we're forecasting -- a or b. Then "type" on the right side represents the new variable names that will be filled by the values.
Hope this is useful to others so they don't end up spending hours agonizing over the issue as did I.
If you are going to be doing a lot of forecasting like this, I recommend taking a look at the pandas library for python. It is much faster than R for data frame work.
ReplyDeletehttps://duckduckgo.com/?q=R+versus+python+pandas+speed
You also get the benefit of beautiful pythonic code over R soup.