Beat the Big Bit

ideas rápidas

L1 trend filtering in Python

The last couple of weeks I’ve been diving into the world of convex optimization by taking the amazing Stephen Boyd’s course. One of the things that have been troubling me pretty much since I started to use techniques such as Robust PCA or SVM is how exactly do you optimize such non-differentiable cost functions.

So, I took the course and then, I implemented a method that I’ve been trying to use in Python, but that I found non actual implementation. This method is called L1 trend filtering, it does the following. Given a time series such as the one below, it finds a way to describe it as a piecewise linear function.

image

By piecewise linear function I mean a function which is a juxtaposition of linear functions. The way the method works is amazingly simple, yet really powerful.

Suppose our time series function is defined as \( x(t) \), so that \( x(1) \) is our first observation, \( x(2) \) the second one, and so on. Our goal is to find \( y(t) \) such that it is piecewise linear.

So, since \( y(t) \) is piecewise linear, the first derivative is going to be constant on each stretch where the function is linear, thus the second derivative will be zero. Yes, I know what you’re thinking: the derivative is not actually defined!!

Ok, since we are in the discrete word, we can define the first derivative of \( y \) as \[ y’(t) = y(t) - y(t-1) \] and the second derivative as \[ y’’(t) = y’(t) - y’(t-1) = y(t) - 2 y(t-1) + y(t-2) \]

Taking that into consideration we can solve the following optimization problem

\[ minimize_y \sum_t (x(t) - y(t))^2 + \delta \sum_t |y’’(t)| \]

The first term is a quadratic error term. It basically means “stick close to the original time series”. The second term penalizes solutions in relationship with it’s second derivative. Thanks to the super powers of L1 norm (a.k.a. sum of absolute values), the solution will have very few points where the second derivative will be non-zero, so voilà!

Some time series, X axis is time, Y axis represents the value of some observed phenomenon over time

Here are two different solutions to the problem by changing the \( \delta \) value. This parameter represents the tradeoff between how close \( x(t) \) is going to be to \( y(t) \) and the number of times it has an inflection point.

Usually in the literature, this parameter is represented using \( \lambda \), but that is a Python’s keyword, so I used \( \delta \) instead :)

Since this technique is so robust, we can also use it to remove outliers, take a lookimage

I added some noise to the series, then I fitted the L1TF method, and removed everything that is too far from the fit, leading to the green line.

Just in case you missed it above, here’s the code

Hope it’s useful!

Safely opening files

It’s been a long time since I last updated my blog. So, instead of doing very polished, trying-to-be-well-writen posts, I’ll just write what I have in mind.

Today I needed to write a file with the result of a very long process that is divided in many many small sub-processes. So, instead of waiting for it to finish, I decided that I would write on the file the each of the sub-results as soon as I had them.

When I started doing this sort of custom code, I saw the pattern that I would like to share now

Here’s the code of a Python’s context manager for doing safe writes to a file, so that if an exception gets risen within the context, the file does not get wiped out.

So, if you do the following 

The delete_me file contents are not deleted.

Hope it’s helpful!

Sometimes libs happen

Some days ago I faced the problem of deploying OpenCV and its Python wrapper to Heroku.

At first sight, it seems a pretty easy task, just a combination of “pip install”s and “make install” should work… as you may guess, this was not the case.

Firstly, since OpenCV’s main python wrapper is not installed using pip install, we need to add a buildpack to our configuration. This buildpack downloads the OpenCV package, compiles it and installs it. To my surprise it also has to download, compile and install cmake.

So, once I understood how to add the buildpack, I did the magic heroku deploy, which translates to git push heroku master and the buildpack started to download cmake and compile it. Again, as you may guess, not only it didn’t work, but I didn’t get any error message to understand why it was not working.

So I had to learn how buildpacks work, log into the machine using ssh, and do the buildpack steps by hand. The problem was that the url from where the buildpack was trying to download the code did not exist any more.

Perfect, cloned the buildpack repo, fixed that, tested that it actually worked when logged into the machine, and changed my heroku configs for it to use my fixed version of the builpack.

However - of course -, when I tried to deploy it didn’t work. It seemed like some kind of timeout failure, since the error was always at different points in the deploy. I changed some heroku’s obscure configuration using some labs stuff and the COMPILE_TIMEOUT variable and it didn’t work. I also tried incresing my ssh timeout configuration…

So I realized… I can compile outside heroku and copy the binaries, I also have a 64bit linux server! At first, it seemed to work, but I only needed more files, until I got to the libc.so.6. The problem is that heroku has a 2.11 version, and I have a 2.14 one… After asking, that is because they have an Ubuntu 10.10!!

By the way, having both libraries coexisting makes even ls seg fault.

Ok, I’m not deploying it on heroku, but luckily, I could manage to get something good out of this. I made a script that gets you all the dependencies of a given set of .so files.  It works recursively, so it finds the dependencies closure. Take a look

Perceptual basis of evolving western musical styles

Music evolves through history in recognizable form, driven in part by technical and societal changes, but largely following its own dynamics. These changes lead to characteristic usages of different musical features.

The question that we asked ourselves was: Can we actually measure that?image

We used the Peachnote corpus to find patterns in the melodic interval statistics, and found that these statistics almost categorically clustered the years around Baroque, Classic, Romantic and post-Romantic periods.

Once we found evidence that those statistics are useful for recognizing styles, we applied a pattern recognition technique to try to understand the clusters. The results were surprising, patterns were musically and cognitively interpretables

The cognitive interpretation follows from the fact that when we listen to music, we  continuously predict what is going to sound next. That predictions must somehow agree with the statistics of a massive corpus, otherwise we would constantly expect unlikely events.

Do you want to get into the details? Read the full article!

Perceptual basis of evolving Western musical stylesby Pablo H. Rodriguez Zivic, Favio Shifres, and Guillermo A. Cecchi. 

 

In the news

The paper appeared in several media. The following video was taken from the Royal Institution of Australia (the article is reviewed at minute 2:03)

The article was also reviewed in the Advancement of Science newsitalian Scientific American, University of Buenos Aires newsengadget the IBM research blog, DRadio Wissengolem.de (a german magazine) and Indonesia Raya News

Archivado en Music perception Pattern Recognition

Subtitle downloads correlates with reported number of viewers

A couple of months ago I decided to start watching The Big Bang Theory. I have to admit I never gave it a try. So, I started to watch from the first season’s Pilot owards.

When I finished the second season, I noticed something unusual on the number of downloads of the subtitles of the first episode of season three: it was twice as big as the previous episode!

Thus, I wondered whether it is related to the actual Nielsen-like rating of the series. 

In order to answer it, I made two quick and dirty scrappers. One for tvsubtitles.net, whose code can be downloaded from here. The other was used to get the actual number of viewers. For this case, the fan page the-big-bang-theory.com was of great help, whose scrapper is here.

Once I got the data, it was time for plotting it. The following figure depicts four curves. The first three are a scaled version of the number of downloads of English, Spanish and Russian subtitles. The last is the scaled version of the actual number of viewers.

image

The  black colored dashed line marks the end of the fifth season. It can be seen that from that point on, the trends have very different behaviour. This may be due to a lag between the number of viewers of a tv series and the number of downloads of a subtitles.

One can tell that the curves correlate just by seeing them. Nevertheless I run a  Spearman’s correlation, and a goodness of fit test resulting in a significative correlation between the downloads of English subtitles and the rating of the show, \(\rho = 0.64\), p-value < \(3.6\times10^{-15}\)

The question that now rises is: Who are we measuring by counting the number of downloads of English subtitles? and, Why does it correlate with the number of viewers?

Nonetheless, there’s more! After checking that there is an actual correlation, I made some scatter plots to see how the distribution looks like. To my surprice, the dots follow a specific trajectory in the space of number of viewers vs. number of downloads, as it is depicted on the following figures.

image

image

image

I hope you find it as intriguing as me!

Información mutua

Suelo tener la falencia de programar cosas y luego darme cuenta que había una librería que lo calculaba por mi.

Puede que este sea el caso, pero bueno, ya es demasiado tarde, porque ya está programado.

Necesitaba calcular la información mutua que tienen variables contínuas con una variable discreta. Después de buscar un rato, decidí que iba a ser más rápido programarlo yo. 

Dejo el código por si a alguien le pasó lo mismo que a mi.

Como para testear que anda bien, podemos probar dos cosas.

1) Ver que una variable dependiente de otra tiene alta información mutua y que dos variables independientes tienen baja información mutua

Para ver esto podemos definir 3 variables aleatorias

\( X \sim Normal(0,1) \)

\(Y_1 = 1 \Leftrightarrow X < 0, Y_1 = 0 \) si no y

\(Y_2 = Binomial(1,\frac{1}{2}) \)

2) Ver que valga la propiedad de que

\[I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)\]

En este caso, la función que calcula la información mutua no es estríctamente simétrica, porque el vector x contiene valores reales, pero el vector y no. Esto hace que haya que discretizar a x en un caso:

Este código arroja el siguiente resultado

¿Tu mail es muy largo? Parte II

Después de la publicación del último post@marcelorinesi me preguntó si la distribución de la longitud de los mails podría relacionada con la longitud de los nombres de las personas.

Hace un tiempo tuve acceso a una base de datos de nombres de Argentina, asi que hice un comparación veloz.

Por un lado calculé la distribución de nombres y apellidos. Nuevamente si tu nombre/apellido esta entre 4 y 9 letras, sos una persona muy normal.

Además, calcule la suma entre estas dos variables usando la ley de probabilidad total:

\[ P(|nom| + |ap| = s) = \sum_{i=1}^{s-1} P(|nom| = i)\times P(|ap|=s-i) \]

Donde \(|nom|\) y \(|ap|\) representa la longitud del nombre y apellido respectivamente

¿Qué se puede ver entonces en el gráfico de arriba?

Bueno, primero que la longitud de los mails tiene muchisima más variabilidad que los nombres y apellidos. Sin embargo, la longitud de los mails está entre los nombres/apellidos y los nombres + apellidos, lo cual sugiere que es común la utilización de iniciales en los mails como mi mail de la facu prodriguez@dc.uba.ar.

De todas formas, también hay mails del estilo estacasillanoexiste@gmail.com que nada tienen que ver con los nombres, sin embargo parece no ser el caso, dado que sólo un 25% de las casillas no contienen un nombre/apellido. 

¿Tu mail es muy largo?

Siempre estuvo delante mio, pero nunca le presté atención. 

Hace rato que tengo una base de datos mediana de direcciones de mail: ¡¡Todas las a las que alguna vez escribí o me escribieron!!

Si bien no son muchos, 15k direcciones es un número suficientemente grande como para jugar un poco. Así que me hice esa pregunta ¿Mi mail es demasiado largo?

Muy facil de medir: Midamos la longitud de los 15k mails que tengo!

Como resultado se ven dos cosas. Primero, que parece haber una longitud preferida por la gente de 22 caracteres (incluyendo la parte de @blabla.com).

Segundo, hay como picos extrañisimos muy muy lejos! Esas direcciones son del estilo de las que usa Twitter para avisarte de cosas.

Así que si tu mail mide entre 20 y 25 caracteres sos una persona muy normal

Taller de composición algoritmica en la ekoparty

Por más extraño que parezca, estuve en la Ekoparty - conferencia de seguridad informática - dando un taller de composición algorítmica de música.

Si queres hacer la experiencia del taller en tu casa, seguí estos pasos:

1) Instalar estos paquetes

sudo apt-get install python-setuptools graphviz graphviz-dev libgraphviz-dev python-matplotlib timidity

sudo easy_install ipython==0.13
sudo easy_install ipdb==0.7
sudo easy_install pygraphviz==1.1

2) Bajarse https://dl.dropbox.com/u/11150175/taller_musica.tgz

3) Descomprimirlo (tar xvf taller_musica.tgz)

4) cd lib/midistuff; sudo python setup.py install

5) Mirar las slides

6) A programar!