---
title: "irtrees 1.0.0"
author: "Zhaojun Li, Ivailo Partchev and Paul De Boeck"
date: "December 6, 2021"
bibliography: irtrees.bib
output: rmarkdown::html_vignette
vignette: >
  %\VignetteEncoding{UTF-8}
  %\VignetteIndexEntry{irtrees 1.0.0}
  %\VignetteEngine{knitr::rmarkdown}
  %\usepackage[UTF-8]{inputenc}  
---

```{r setup, echo=FALSE,message=FALSE,warning=FALSE}
knitr::opts_chunk$set(collapse = T, comment = "#>")
options(tibble.print_min = 4L, tibble.print_max = 4L)
set.seed(123)
library(irtrees)
library(flextable)
library(reshape2)
library(mirt)
library(lme4)
```

The first version of package __irtrees__ contained just two functions, dendrify() and exogenify() (preserved in the current version for compatibility), and some data sets permitting the user to replicate the examples in @deBoeck2012. The main interest lied in the psychometric methodology introduced by the paper. Nine years later almost to the date, IRTrees models have become quite popular, with applications to many aspects of human and animal behavior. An update of the package was overdue, and it includes several substantial improvements.

Back in 2012, the principal way to estimate an IRTrees model was with the glmer() function in package __lme4__ [@lme4]. Hence, we needed to translate all data sets to the long data shape expected by that package. Another constraint was that all items in the test were expected to follow the same IRTrees model. In the meanwhile, more flexible IRTrees models have been introduced, and the range of possibilities to estimate them has expanded to include __mirt__ [@mirt], __flirt__ [@flirt],  but also __TAM__, __STAN__, __MPlus__, $\ldots$ the list can certainly be continued. Each of these programs has different demands on the shape of input data, and they also tend to offer differential advantages. This new version of __irtrees__ strives to facilitate IRTrees modeling with different software and for a wider range of models. 

First and foremost, a set of new functions written by Zhaojun Li allow users to prepare their data sets in a much more flexible way:

* flexible category-by-node mapping matrices that permit both dichotomous nodes and polytomous nodes
* a possibility to use different category-by-node mapping matrices across items 
* item/person-specific covariates and higher-level covariates (e.g., school, country) when data sets are multilevel
* repeated measures as in longitudinal studies. 

Both the original data sets and the generated IRTrees data sets may be either wide-format or long-format. This allows for extra flexibility in the choice of software for the actual estimation of the model parameters. 

Another addition is a function that allows users to draw the tree for a specific item and have it automatically translated to a mapping matrix. We see two advantages in this: first, the graphical output may be useful for publication; second, even if setting up a mapping matrix is relatively simple, there will always be persons who think better in terms of matrices and those feeling more at ease with graphs.    

## Wide and long format

As mentioned above, some of the software packages that can fit IRTrees models expect to see the data in long format, while others require wide format. Thanks to the __tidyverse__ movement [@tidyverse], awareness of long format has increased in recent times. Until not too long ago, a statistician could reach retirement without ever seeing a data set in long format. After all, wide format, the sample-by-variable matrix, is the native data representation in statistics. It is predominant in textbooks, easily understood, it readily yields summary statistics and conveniently highlights the structure of missing data (if any). Two related matrices, the sample-by-sample dissimilarity matrix and the variable-by-variable covariance (or correlation) matrix, provide the input for a plethora of statistical methods. A very minimal example of a data set in wide format is shown below.

```{r, echo=FALSE}
g = data.frame(person=c('John','Mary'),
               sex=c('male','female'),
               item1=c("A","A"),
               item2=c("C",'B'))
flextable(g)
```

From the point of view of the professionals in data management, the wide format is quite insane because the vessel containing the data must be remade from scratch with each problem: both the number of the columns and their meaning are redefined. On the other hand, _any_ problem can be accommodated by a single data structure with three columns whose meaning is fixed once and for all: sample, variable, and value:

```{r, echo=FALSE, warning=FALSE}
flextable(melt(g,id.vars='person'))
```

Whether the test has 2, 20, or 200 items is now a data problem, not a data structure problem. Missing values do not have to be stored at all, which can be very efficient with sparse data; on the other hand, we cannot observe directly any potentially revealing patterns of missingness. Computing covariance or distance matrices from data in long format is not as easy as compared to wide format. Long before we come to IRTrees models, we observe that the two data representations have their relative merits and disadvantages!

The data format in which we are actually interested, and which we will call "long" in this vignette, is somewhat different:

```{r, echo=FALSE, warning=FALSE}
flextable(melt(g,id.vars=c('person','sex')))
```

Note how the person property, in this case sex, is replicated with each response. This is because the data is actually multi-level. The basic unit of interest is the response, resulting from the interaction of an individual with an item. Responses are nested within persons, which is why we need to replicate all person properties. The same would be true of any item properties (note that we would not be able to include these directly in wide format, as they pertain to the columns). A huge advantage of this format from the statistical point of view is that the item side and the person side are perfectly symmetric, which makes it easy to include both item and person covariates, and possibly their interactions, in the model. This is not something that would preoccupy data management people: they would simply place the data for each level in separate tables related by key variables:

```{r, echo=FALSE, warning=FALSE}
flextable(melt(g[,-2],id.vars=c('person')))
flextable(g[,1:2])
```

## Eight new functions

Eight new functions can be used to prepare data originally in wide or long format for IRTrees modelling in wide or long format, and with one or multiple trees involved. These are summarized in the following table:  


```{r, echo=FALSE}
f=expand.grid(c('Wide','Long'),c('Wide','Long'),c('single','multiple'))
names(f)=c('Input','Output','Tree')
f$Function = paste0(substr(f$Input,1,1),'to',substr(f$Output,1,1),'_',f$Tree,'.tree')
f$Function = gsub('multiple','multi',f$Function)
autofit(flextable(f))
```

We are not going to discuss each function in full detail. After one more detailed example, we explain how and why the functions differ, and we provide examples where needed.

## Simple example: wide input, wide output, all items share the same tree

As a first example, consider the following dummy data set:

```{r}
five_wide = data.frame(
  id = 1:100,
  gender = factor(sample(c('M','F'), 100, replace=TRUE)),
  item1 = sample.int(5, 100, replace = TRUE),
  item2 = sample.int(5, 100, replace = TRUE),
  item3 = sample.int(5, 100, replace = TRUE),
  item4 = sample.int(5, 100, replace = TRUE),
  item5 = sample.int(5, 100, replace = TRUE)
  )
str(five_wide)
```

There are 5 five-point items (not very good ones, but we are not going to analyze this data anyway, just show how to prepare it for analysis); a person ID variable; and a person-specific covariate, gender. Item categories 1 through 5 represent 'Strongly disagree', 'Disagree', 'Neither agree or disagree', 'Agree', and 'Strongly Agree', respectively. All items are assumed to follow the same IRTrees model which can be represented as follows: 

```{r, echo=FALSE}
library(DiagrammeR)
grstr = "digraph linear {
            node [shape=oval]
              Y1; Y2; Y3; Y4
            node [shape=box]
              Neutral; 'Strongly disagree'; Disagree; Agree; 'Strongly agree'
            edge [label='1']
              Y1->Neutral; Y2->Y4; Y3->'Strongly disagree'; Y4->'Strongly agree'
            edge[label='0']
              Y1->Y2; Y2->Y3; Y3->Disagree; Y4->Agree;
          }"
tree=grViz(grstr)
tree
```

We have borrowed this tree from @extreme. It provides a good example of an IRTrees model and of the 'pseudo-item' approach to fitting it. According to the hypothesized tree structure, there are four binary internal nodes of which the first, Y1, determines whether the response is a midpoint response (category 3) or any of other responses. If not midpoint, a second node, Y2, determines whether the response is positive (category 4 or 5) or negative (category 1 or 2). Given the direction of the response, two further nodes, Y3 and Y4, determine whether the response is extreme or not (category 1 vs. 2 if negative, and category 5 vs. 4 if positive).

From the tree, it is not difficult to deduce the mapping matrix:

```{r, echo=F}
five_cmx = matrix(c(0,  0,  1,  0,  0,
                    0,  0,  NA, 1,	1,
                    1,  0,  NA, NA, NA, 
                    NA, NA, NA, 0,  1), nrow = 5)

cats = c('Strongly disagree', 'Disagree', 'Neither agree or disagree', 'Agree', 'Strongly Agree')
z=cbind(cats,as.data.frame(five_cmx))
names(z)=c(' ',paste0('Y',1:4))
flextable(z) |> colformat_num(na_str = "NA") |> bg(j=grep("Y", colnames(z), value = TRUE), bg='skyblue', part='body')
```

Each column represents a node. Each row corresponds to a response category, and contains the coefficients received from each node. For example, the fourth category, Agree, has:

* a sub-response of $0$ at the first node because it is not a midpoint;
* a sub-response of $1$ at the second node because it is a positive response;
* a sub-response of $NA$ at the third node because this node is not applicable: it only discriminates categories with a negative attitude (i.e., categories 1 and 2); and
* a sub-response of $0$ at the fourth node because it is not an extreme response.

Having the mapping matrix, we are ready to recode the data set with function WtoW_single.tree():

```{r}
five_wide_single.tree = WtoW_single.tree(
  data = five_wide, 
  cmx = five_cmx, 
  id.col = 1, 
  resp.col = c(3:7), 
  covar.col = 2
  )
str(five_wide_single.tree)
```

In this call, 

* `data` is the (wide form) data set to be recoded,
* `cmx` is the category-by-node mapping matrix,
* `id.col` is the column number of the person ID variable,
* `resp.col` is a vector of the columns of responses,
* `covar.col` is the column(s) of covariates, if existing.

Column names may be used instead of column numbers:

```{r, eval=FALSE}
five_wide_single.tree = WtoW_single.tree(
  data = five_wide, 
  cmx = five_cmx, 
  id.col = "id", 
  resp.col = c("item1", "item2", "item3"), 
  covar.col = "gender"
)
```

In a longitudinal survey where the same test is taken by the same person more than once, data in wide format can be stacked together with an additional variable identifying the test occasion; this must be passed to the function with the additional argument, `time.col`. 

## Variation: wide input, long output, all items share the same tree 

The wide output file would be fine if we wanted to estimate the IRTrees model with a program like __mirt__, but what if we prefer to use __lme4__ as in the original JSS paper? All we need to do is change the function; the input is the same, and so are all parameters:

```{r}
five_long_single.tree = WtoL_single.tree(
  data = five_wide, 
  cmx = five_cmx, 
  id.col = "id", 
  resp.col = c("item1","item2","item3"), 
  covar.col = "gender")

str(five_long_single.tree)
```

## Complication: wide input, items may have different trees

The original version of __irtrees__ assumed that all items have the same number of categories governed by the same tree. To relax this rather severe limitation, we need to replace the mapping matrix with a list of mapping matrices, and provide another list of the same length specifying the items to which each of the matrices will be applied. All other parameters of the function remain unchanged.

As an example, let us assume a test of five items and two different trees specified with three mapping matrices, `mx1`, and `mx2`. The first matrix applies to items 1, 3, and 5, and the second matrix applies to items 2 and 4. We already have one matrix, `five_cmx`, and let the other one be:

```{r, echo=FALSE}
five_cmx.2 = matrix(c(0,  0,  1,  0,  0,
                      0,  1, NA,  2,	3), nrow = 5)
five_cmx.2
```

We then need to specify:

```{r, eval=FALSE}
matrices_list = list(five_cmx, five_cmx.2)

items_list = list(c('item1', 'item3', 'item5'), c('item2', 'item4'))

five_wide_multiple.tree = WtoW_multiple.tree(
  data = five_wide, 
  cmx_list = matrices_list,
  resp.col_list = items_list,
  id.col = "id", 
  covar.col = "gender"
)
```

To produce an output data set in long rather than wide shape, just replace WtoW_multiple.tree() with WtoL_multiple.tree(). 

## What if the input data is in long shape?

Test data may be put in long shape as a matter of preference, but it may also originate like this. In older times, when all tests were paper-based and results had to be typed in manually or, with some luck, scanned with an optical reader, wide shape was ubiquitous. With computer-based testing becoming widespread, data is much more likely to arrive as person-item-response triplets. This is handy if we want to fit our models in __lme4__: in particular, it is equally easy to include person covariates, item covariates, or interactions of both. 

To illustrate, we will use a much-analyzed data set that was also the subject of our previous paper [@deBoeck2012]. It concerns the self-reported verbal reactions to four different embarrassing situations. Person covariates include gender and trait anger. Item covariates include the situation, the type of reaction ('curse', 'scold', or 'shout'), and its mode  ('actually do' vs. 'want to do'), while the possible responses for each combination are 'no', 'perhaps', or 'yes'. One of the interesting findings with this data set is that, other things equal, women 'want to' react verbally as often as men do, but 'actually do' react less frequently. This is a hypothesis about an interaction between a person property and an item property. 

There may be many good reasons to perform an analysis with __mirt__ instead: faster estimation of fixed parameters, the possibility to use the 2PL model or perhaps an unfolding model, etc. However, not only does __mirt__ expect the data in an $n\times p$ shape, with $n$ the number of persons and $p$ the number of items; any person properties to be used as covariates must be supplied in a separate data frame with $n$ rows, while the item properties, if any, must go into yet another data frame with with $p$ rows. While the person properties will be found in the first columns of the IRTrees wide data set we are going to generate, any item covariates are simply discarded and the corresponding data frame must be constructed by hand. 

The data set is bundled with the __lme4__ package. Currently __irtrees__ expects the item categories to be represented with consecutive integers starting with 1, so we start by transforming the factor holding the responses to a vector of integers:  

```{r}
VerbAgg$resp = as.integer(VerbAgg$resp)
```

As in [@deBoeck2012], we assume a linear tree that expresses the idea of an ordinal item:

```{r, echo=FALSE}
lintree = 'graph TB
  X1((X1))  --0-->  L1[no]
  X1((X1))  --1-->  X2((X2))
  X2((X2))  --0-->  L2[perhaps]
  X2((X2))  --1-->  L3[yes]'
DiagrammeR::mermaid(lintree)

lincmx = graph2mx(lintree)

cats = c('no', 'perhaps', 'yes')
z=cbind(cats,as.data.frame(lincmx))
names(z)=c(' ',paste0('X',1:2))
flextable(z) |> colformat_num(na_str = "NA") |> bg(j=grep("X", colnames(z), value = TRUE), bg='skyblue', part='body')
```

For the verbal aggression example, with the mapping matrix in `lincmx`, transforming the long data set to wide starts with:

```{r}
VerbAgg$resp = as.integer(VerbAgg$resp)

VerbAgg_wide = LtoW_single.tree(
  data = VerbAgg, 
  cmx = lincmx, 
  id.col = "id",
  item.col = "item", 
  resp.col = "resp", 
  covar.col = c("Anger","Gender","btype","situ","mode"))

str(VerbAgg_wide)
```

The first thing to do after that is make sure that the order of items in the wide data set is as expected; otherwise we might end up specifying the wrong model. Next, we construct the person covariates data frame and the item covariates data set; following the __mirt__ conventions, we will call these `covdata` and `itemdesign`. Note that the first one can be used with both the mirt() and the mixedmirt() functions, while the second one is only available with mixedmirt().

```{r}
covdata = VerbAgg_wide[,2:3]
VerbAgg_wide = VerbAgg_wide[,-(1:3)]
itemdesign = data.frame(node = factor(rep(1:2, each=24)))
itemdesign$mode = factor(ifelse(grepl('Do', names(VerbAgg_wide)), 'Do', 'Want'))
```

We will also make a long data set in order to run both __lme4__ and __mirt__:

```{r}
VerbAgg_long = LtoL_single.tree(
  data = VerbAgg, 
  cmx = lincmx, 
  id.col = "id",
  item.col = "item", 
  resp.col = "resp", 
  covar.col = c("Anger","Gender","btype","situ","mode"))

str(VerbAgg_long)
```

The adjustments for situations with more than one tree are similar in logic to those for wide shape data: supply a list of matrices and a list of items to which they apply. 

## A complete example

One of the hypotheses tested in @deBoeck2012 had to do with the ordinality of the items. This can be accomplished by comparing the goodness of fit of the one-dimensional model (a common latent trait for both nodes) with the two-dimensional model (a different latent trait per node). In __lme4__, the two models would be:

```{r, eval=FALSE}
model1 = glmer(resp ~ 0 + node:item + (1 | id),        family=binomial, data=VerbAgg_long)
model2 = glmer(resp ~ 0 + node:item + (0 + node | id), family=binomial, data=VerbAgg_long)
```

48 or 96 fixed parameters may take quite a long time in __lme4__, so we might prefer to treat the items as random:

```{r}
model1 = glmer(resp ~ 0 + node + (0 + node | item) + (1 | id),        family=binomial, data=VerbAgg_long)
model2 = glmer(resp ~ 0 + node + (0 + node | item) + (0 + node | id), family=binomial, data=VerbAgg_long)
```

The difference between the two models is in the random effects for the persons: the one-dimensional model has `(1 | id)` while the two-dimensional model has `(0 + node | id)`. Both models have a fixed effect of `node` -- in the second model it provides the means of the two correlated latent variables.

In __mirt__, we define the univariate and the multivariate model more explicitly:

```{r}
mm1 = "F1=1-48"

mm2 = "F1=1-24 
       F2=25-48 
       COV=F1*F2"
```

__mirt__ has no fear of a large number of fixed parameters -- estimation is quite fast, and the code would be something like:

```{r, eval=FALSE}
mirt1 =  mirt(data = VerbAgg_wide, model = mm1, itemtype="Rasch", verbose=FALSE)
mirt2 =  mirt(data = VerbAgg_wide, model = mm2, itemtype="Rasch", verbose=FALSE)
```

However, we will use mixedmirt() for better comparability with whatever we did with __lme4__. With mixedmirt(), we can make the item parameters random, and we can include the effect of `node`. The code becomes:

```{r}
mirt1 =  mixedmirt(data=VerbAgg_wide, model=mm1, fixed = ~ 0 + node, random= ~ 1 | items, 
                   itemdesign=itemdesign, SE=TRUE, verbose=FALSE)
mirt2 =  mixedmirt(data=VerbAgg_wide, model=mm2, fixed = ~ 0 + node, random= ~ 1 | items, 
                   itemdesign=itemdesign, SE=TRUE, verbose=FALSE)
```

The likelihoods seem to differ between __lme4__ and __mirt__ but the general message is the same:

```{r}
anova(model1, model2)
anova(mirt1, mirt2)
```


## Specifying mapping matrices in mermaid

When we deal with tree models, publication is usually on our mind, and we can do with some nice tree diagrams. One handy way to produce some is with __mermaid__, which is a javaScript library equipped with its own easy language to describe graphs. __mermaid__ supports many types of diagrams; the one we need is called 'flowchart' in their nomenclature. The easiest way to produce a tree diagram in __mermaid__ is with their [live editor](https://mermaid-js.github.io/mermaid-live-editor/). On the left, replace their own example with

```
graph TB
  X1  --0-->  no
  X1  --1-->  X2
  X2  --0-->  perhaps
  X2  --1-->  yes
```

As you type, you will see the tree emerge on the right. When you are done, click on 'COPY IMAGE TO CLIPBOARD' and paste the diagram into your Word or LibreOffice document. Further buttons export the graph in a raster format (PNG) or a vector format (SVG).

__mermaid__ is supported in __R__ with package __DiagrammeR__ [@DiagrammeR], which is loaded automatically by __irtrees__. When authoring in RStudio, the graph can be displayed in the following way:

```{r}
mytree = 'graph TB
  X1  --0-->  no
  X1  --1-->  X2
  X2  --0-->  perhaps
  X2  --1-->  yes'

DiagrammeR::mermaid(mytree)
```

This works well if the output is an HTML file; to make it also work with PDF output, install __webshot__ with

```
install.packages("webshot")
webshot::install_phantomjs()
```

Although it is quite easy to derive the mapping matrix from the tree, we thought it would be nice to have a function that could do this for us automatically. This is fully possible within the __mermaid__ language, but we need to add some conventions of our own. __mermaid__ distinguishes between a node name and a node label, and so far we have only used node names. To be able to derive the matrix, we will play with the names and the labels in the following way:

* We will use node names X1, X2, X3, $\ldots$ for the latent nodes and L1, L2, L3, $\ldots$ (think of 'leaves', the name for the terminal nodes of a tree)
* We will equip all latent nodes with circular labels repeating the node name like ((X1)); and all leaves with rectangular labels and any arbitrary text, for example [I think I disagree]
* Although it is sufficient to specify the node label only at the first occurrence of the node, we will repeat them each time; the reasons for this will be explained below.

Following these additional rules, our tree description string becomes

```{r}
mytree = 'graph TB
  X1((X1))  --0-->  L1[no]
  X1((X1))  --1-->  X2((X2))
  X2((X2))  --0-->  L2[perhaps]
  X2((X2))  --1-->  L3[yes]'
```

and would produce the same graphical output as before, but now we can get the mapping matrix with:

```{r}
mmx = graph2mx(mytree)
mmx
```

Why repeat the labels? Sometimes we want to cheat. Consider the response styles tree from @extreme that we examined at be beginning of this vignette and reproduce now in __mermaid__:

```{r}
tree1 = 'graph TB
  X1((X1)) --1--> L1[Neutral]
  X1((X1)) --0--> X2((X2))
  X2((X2)) --0--> X3((X3))
  X2((X2)) --1--> X4((X4))
  X3((X3)) --1--> L2[Strongly disagree]
  X3((X3)) --0--> L3[Disagree]        
  X4((X4)) --0--> L4[Agree]
  X4((X4)) --1--> L5[Strongly agree]'

DiagrammeR::mermaid(tree1)
```

```{r}
graph2mx(tree1)
```

We have, in X2, the target trait (positive or negative opinion), and, in X3 and X4, two extreme style nodes: one for positive extreme, and one for negative extreme responses. What if we are happy with just one extreme style variable, irrespective of positive or negative? We cannot just merge X3 and X4 like this:

```{r, echo=FALSE}
mytree = 'graph TD
  X1((X1)) --1--> L1[Neutral]
  X1((X1)) --0--> X2((X2))
  X2((X2)) --0--> X3((X3))
  X2((X2)) --1--> X3((X3))
  X3((X3)) --1--> L2[Strongly disagree]
  X3((X3)) --0--> L3[Disagree]        
  X3((X3)) --0--> L4[Agree]
  X3((X3)) --1--> L5[Strongly agree]'

DiagrammeR::mermaid(mytree)
```

because we will mess up the edges to the leaves. But, cheating a bit with the node labels, we can get the right mapping matrix:

```{r, echo=FALSE}
tree2 = 'graph TD
  X1((X1)) --1--> L1[Neutral]
  X1((X1)) --0--> X2((X2))
  X2((X2)) --0--> X3((X3))
  X2((X2)) --1--> X4((X3))
  X3((X3)) --1--> L2[Strongly disagree]
  X3((X3)) --0--> L3[Disagree]        
  X4((X3)) --0--> L4[Agree]
  X4((X3)) --1--> L5[Strongly agree]'

DiagrammeR::mermaid(tree2)
```

```{r}
mmx = graph2mx(tree2)
mmx
```


## References