Generative dance soloists improvising in the 2-dimensional screen space _
_ using Deep Learning
This project is done as the final assignment for the course:
IS71074A / Data and Machine Learning for Artistic Practice (2018 – 19)
term II, MA/MFA in Computational Arts
Goldsmiths University of London
Instructor: Dr. Rebecca Fiebrink
It is _
“Generative dance soloists improvising in the 2-dimensional screen space” are computer generated sequences of movement, embodied by plain stick figures and collaged on top of photographical backgrounds where the soloists can be viewed within a transcendental and poetic context. The system of generating movement is trained after video recordings of one solo dancer improvising in the humanly perceived 3-dimensional space. At the core of this system is a deep recurrent neural network (RNN) trained on skeleton tracking data (x and y coordinates of 13 body points) and predicting new coordinates in sequence. “Generative dance soloists” can be seen as a dynamic visual artwork on its own or as a method of generating choreographic scores for further physical exploration from a choreographer.
Creative purpose _
The personal intention behind this process is to free my-/one-self from what is believed to be bodily possible or creatively desired. I consider generative models a celebration of the endless possibilities, the existence of which I might not be aware when choreographing or performing.
In my practise as a dance maker, I always aim for improvisational conditions where the bodily instinct and the momentum give birth to new materials. But since there exists the bias of the same conscious human body/mind practising, computers might perform better in accord to improvisation. The term “improvisation” invokes associations with such related notions as extemporisation, invention and origination (Carter, 2000), notions that strongly relate to what generative models deliver in response to their training data. Therefore, what I could not find in a systematic preconceived choreographic process, I might discover in a deep learning generative process. I might find the stick figure flying, jumping in extreme heights, changing hip length, bending its waist by 270 degrees or equally, avoiding stagnation.
The inclusion of a static background behind the kinetic generative output is derived by my need to situate the soloists in a poetic land, where human bodies could not conventionally dance. And by that, I complete what I speculate the generative model is doing: choreographic poetry.
Related work and references _
There is already an archive of earlier work in the field of generating realistic human motion (Yi Zhou et al., 2018; Holden et al., 2016; 2017; Luka Crnkovic-Friis et al., 2016; Fragkiadaki et al., 2015; Jain et al., 2016; Bütepage et al., 2017; Martinez et al., 2017) with applications in human-machine collaboration for choreography, animation, video games and more. Recurrent neural networks and specifically LSTMs (long short-term memory networks) are the kind of networks I was searching for in my references.
The most recent and relative work to my creative approach was the project “Living Archive” by the choreographer Wayne McGregor (https://waynemcgregor.com/research/living-archive/) where I could directly witness virtual bodies dancing beautifully. Therefore, this project became my visual template to refer to. At the same time, the research “Generative Choreography using Deep Learning” by Luka Crnkovic-Friis and Louise Crnkovic-Friis became my methodological template to refer to. In this paper, the researchers created a deep recurrent neural network (mixture density LSTM), called chor-rnn (in response to chor-eography) which was trained on raw motion capture data (recorded by a Microsoft Kinect v2 sensor) and generated new dance sequences for a solo virtual dancer.
Recordings / Collecting data _
Following the path indicated by my references and tutor Rebecca Fiebrink, I started the project by building my dataset.
Although I was planning to use a personal archive of videos with solo dancers improvising, it appeared to be too diverse in terms of the tracked bodies’ dimensions (actual height and distance from the camera) and the shooting angle. So, since I wanted to have control over my training dataset, I recorded myself improvising for 30 minutes under consistent spatial and recording conditions.
Afterwards, I could choose between importing the raw data of the video recordings (continuous frames/images) to my model, like this project did: https://github.com/jsn5/dancenet or I could choose to import arithmetic values representing the skeleton tracking coordinates, captured through poseNet and organised appropriately in a .csv file, similar to what this project did: https://github.com/keshavoct98/DANCING-AI. The first option would require many intermediary steps until my data would be normalised and most importantly, it would require incomparably more training time and a strong GPU to support the process. Consequently, I chose the second option considering that it would be my first deep learning experience:
I used poseNet (https://github.com/tensorflow/tfjs-models/tree/master/posenet)in order to skeleton track my videos of solo dancing. PoseNet is a machine learning model which allows for real-time human pose estimation in the browser (https://medium.com/tensorflow/real-time-human-pose-estimation-in-the-browser-with-tensorflow-js-7dd0bc881cd5).
I modified the .js source code in order to read my already recorded videos instead of the camera live streaming and then I wrapped poseNet inside electron and served the detected body coordinates via Max/Msp (https://github.com/yuichkun/n4m-posenet). Finally, I coded the relative functions in order to pass the continuous poseNet values into a .csv file with time sequence.
After all, the final dataset was a .csv file (many .csv files while experimenting) which consisted of one time related column (values in milliseconds) and 26 columns of the body coordinates (13 body joints * 2 for the x and y coordinates of the 2-dimensional tracking).
Body joints: LeftShoulder, RightShoulder, LeftElbow, RightElbow, LeftWrist, RightWrist, LeftHip, RightHip, LeftKnee, RightKnee, LeftAnkle, RightAnkle, Nose (representing the head in total).
Generative model _
RNN models would allow me to operate over sequences of vectors; sequences in the input, the output or in most general case both. However standard RNNs are difficult to train for long-term dependencies in a stable way. For such cases, an LSTM model is the solution since it is stable over long training runs and can be stacked to form deeper networks without loss of stability.(Andrej Karpathy blog post: http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
“LSTMs are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997) and were refined and popularised by many people in following work. They work tremendously well on a large variety of problems and are now widely used.” (Colah’s blog post: http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
Therefore, with this supporting theory, LSTM would be the model to use in order to receive sequences of movements as input and generate sequences of movements as output. In my case, the sequences of movements are in the form of sequences of rows of values (x and y coordinates of body joints, in the way my dataset is structured).
I decided to use Google Colab as the free Jupiter notebook environment where I could write and execute my code entirely in the cloud, with access to powerful computing resources. (https://colab.research.google.com/notebooks/welcome.ipynb)
Starting from an empty canvas and with no prior experience in Python programming language (the one I was supposed to use in Google Colab), I browsed around GitHub in search for libraries and references relative to my ideal LSTM model. After trying out various resources and combinations of those, I ended up with these three as my guide:
One single layer of a standard RNN model (upper img) vs four interacting layers of an LSTM model (lower img) from Colah’s blog.
Coding process _
So, I started coding my LSTM model in Python 3, in Google Colab:
I imported the needed libraries: numpy for arrays, pandas for dataframes, matplotlib.pyplot for plots, MinMaxScaler from sklearn for scaling each feature to a given range and therefore normalising data. Keras is a deep learning Python library running on top of TensorFlow. The model I used from Keras is the sequential model, a linear stack of layers. From Keras layers I imported Dense and LSTM.
I then connected google.colab to my google.drive in order to import the desired datasets from there.
I read the specific .csv file with the coordinates tracked from the video pose detection.
I read the shape of my dataset (rows, columns).
I created a plot with all the features changing in time.
(The sample above corresponds to only one of the features: noseX)
I split my dataset in a training and a testing set.
I normalised and reshaped my data (adding one more dimension) as LSTMs are quite particular regarding the way in which the input time series data should be shaped.
I initialised my sequential model and added the desired layers. The number of units I choose, defines the dimension of every inner time-step LSTM cell and the output’s dimension as well. Tanh pushes the values to be between -1 and 1.
I fit my model to the training set and let the training begin . . .
The batch_size is the number of training examples in one forward/backward pass. The higher the batch size, the more memory space I’ll need.
One epoch is one forward pass and one backward pass of all the training examples. The number of passes (each pass using batch_size number of examples) is the number of iterations. Here, since I have 12000 training examples and my batch_size is 600, it will take 20 iterations to complete 1 epoch.
In this step and the previous one is where all the algorithmic experimentation is done.
I read the shape of my test set and reshaped the test set for LSTM.
I created a predicted test set and I read its shape.
I generated my predictions.
I saved my predicted dataset in a .csv file.
I visualised the predicted values with a plot.
I trained my model with 3 different datasets. I started with a small one, produced by a 25sec video. I then moved to a dataset produced by a 12.45min video and finally a 30min assemblage of videos. All of these were tracked for body joints’ coordinates 20 times/sec. For each dataset I played around with different numbers of LSTM units, batch_size and epochs.
Here I present my favourite generated results, one from each dataset:
From left to right:
Trained on 25sec (my very first successful output), LSTM units = 128, batch_size = 16, epochs = 200
Trained on 12.45min, LSTM units = 512, batch_size = 50, epochs = 600
Trained on 30min, LSTM units = 512, batch_size = 50, epochs = 600
After training my model and generating new sequences of x, y coordinates moving, I translated the generated .csv file into actual visual movement of a new stick figure. For this purpose, I continued coding in Python and Google Colab:
During the course’s lectures and labs, we have been mostly taught supervised ML in Wekinator but we were encouraged to experiment with whichever type and framework of ML we wished for our final project. Therefore, since my core idea was to generate movement, I turned my interest towards deep learning and precisely deep recurrent neural networks and LSTMs.The process of firstly understanding this new field and afterwards entering into it was long but it was worth it.
I can identity two stages as the most vital in the deep learning process I followed:
The first one includes the collection and shaping of my dataset. It was the most time-consuming stage, it demanded a big personal archive of videos and it relied significantly on another ML model’s (poseNet) accuracy. Commenting on that, I must admit that the exported .csv datasets were not 100% representative of the actual body joints’ coordinates. Not even close. PoseNet never managed to track lifted legs up in the air. Whenever a lower joint (hip, knee, ankle) was over 45 degrees, poseNet translated it into an upper joint (shoulder, elbow, wrist) and messed up the correspondence of body moving to skeleton moving. Moreover, the fact that poseNet does 2-dimensional tracking ended up in some interesting but not realistic results. The skeleton was growing bigger and smaller which could be seen as a body moving forward and backward or as a body changing scale. Relative to that, poseNet barely could understand a body kneeling on the floor or lying down (the figure never became horizontal). At moments like these, it either created a tiny little figure under the tracked head of the body or a complex non-human-like shape.
The second essential stage includes the actual training of my model. While being driven by curiosity and a playful attitude rather than a systematic methodological plan, I ended up with quite random results that do not really help me make valid observations apart from the general induction that bigger batches and more epochs are usually leading to a better trained model and a kinetically richer output, if not overfitted. In my future trials, I am planning to test the same training conditions (LSTM units, batch_size and epochs) for different datasets and different conditions for the same dataset in a much more consistent way. I am also planning on using a much bigger archive of videos, both of myself and others, normalised accordingly. I suspect that most of my generated kinetic variations tend to overfit the training data because of the limited archive/dataset I provided and the large number of epochs I used. Having a bigger dataset and using less epochs would probably result in a more generative, computed-improvisational output.
Overall, I am extremely pleased with the results I exported. The sequences of movements are quite vague and unearthly, most of the time resembling the state of floating, in such a poetic way that I am convinced that I witness a new kind of dance improvisation. And as I’ve mentioned before, I chose to highlight this even more by adding photographical backgrounds of nature, where humans would not be expected to appear moving. These photographs belong to Stathis Doganis’ archive.
“What my body can do is limited. This is not a bad thing because how I choreograph frees me from those limitations. Writing is then how I reframe and understand the body through my choreography.”
Altering slightly Deborah Hay’s approach, I could claim that what my body can do is limited but currently, generative ML models free me from those limitations. Coding could be a way to reframe and understand the body inside choreography.
– Ashesh Jain, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena. Structural-rnn: Deep learning on spatio-temporal graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5308–5317, 2016.
– Curtis L. Carter. Improvisation in Dance, The Journal of Aesthetics and Art Criticism,Vol. 58, No. 2, Improvisation in the Arts, Spring 2000.
– Daniel Holden, Jun Saito, and Taku Komura. A deep learning framework for character motion synthesis and editing. ACM Transactions on Graphics (TOG), 35(4):138, 2016.
– Daniel Holden, Taku Komura, and Jun Saito. Phase-functioned neural networks for character control. ACM Transactions on Graphics (TOG), 36(4):42, 2017.
– Judith Bütepage, Michael Black, Danica Kragic, and Hedvig Kjellström. Deep representation learning for human motion prediction and classification. arXiv preprint arXiv:1702.07486, 2017.
– Julieta Martinez, Michael J Black, and Javier Romero. On human motion prediction using recurrent neural networks. arXiv preprint arXiv:1705.02445, 2017.
– Katerina Fragkiadaki, Sergey Levine, Panna Felsen, and Jitendra Malik. Recurrent network models for human dynamics. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4346–4354, 2015.
– Luka Crnkovic-Friis and Louise Crnkovic-Friis. Generative choreography using deep learning. 7th International Conference on Computational Creativity, ICCC 2016.
– Yi Zhou, Zimo Li, Shuangjiu Xiao, Chong He, Zeng Huang and Hao Li. Auto-conditioned recurrent neural networks for extended complex human motion synthesis. ICLR. arXiv:1707.05363v5 [cs.LG], 2018.