A tiny piece of powerful Python

    admin

    When working with IATI data I often use Python, a programming language which is very suitable for handling a lot of data. Within a programming language there are libraries: collections of functions with a specific purpose. One library that is very powerful when working with large amounts of data is called Pandas. In Pandas you can store data in large tables, called dataframes. Then Pandas offers a set of tools to perform operations on your tables.

    Here is why I needed to use Pandas. I wanted to extract all the organization names in IATI (to do the analysis as described here and here). One of the problems was that sometimes a single field, say the “provider-org” field, contains multiple values, as you can see in this piece of IATI data:

    To handle this, I added a column to the table for every narrative line. But it turned out one publisher added 63 (!) narrative lines to a single field. The final table looked something like this, going on all the way to narrative_63:

    But to count the number of unique organizations (as described in this and this post) it was more useful to have a single ‘narrative’ column. It was, however, important to keep the data in the first few (non narrative) columns for all these narrative values. The transformation I wanted is shown in the following picture.

    To do this transformation, from the original table we want to:

    1. select the green and blue block and copy them. 
    2. select the green and orange block, copy those too. 
    3. do this for all other narrative columns (green block + narrative column)
    4. paste them on top of eachother. 

    This is where the piece of code comes in, since Pandas has some awesome tool that we can use for this. For example, calling df.columns gives you a list of all the column names in your dataframe (called df). We can use this to make a list of all the columns that have the word ‘narrative’ in the name (nar_cols, the blue and orange columns). 

    Normally you would do this using a combination of a for-loop and an if-statement, like so: 

    For the column names in the list of all the column names in your dataframe; if the column name contains the word “narrative”; add it to the list of narrative columns (nar_cols). Or in code:

    But in Python you can do this in one line using something called a list comprehension:

    If you take a good look you can see this code contains the same elements, it just does all four steps in one line of code.

    Now the first step in our transformation recipe is to select the green and blue block and copy them. In Pandas you can select certain columns from your dataframe using square brackets like this: df[list_of_columns_you_want_to_select]. Now to select the green block, we want to select all columns, except the narrative columns (nar_cols). To do this we use df.columns again, but we remove (or drop) the narrative columns. So the green block would be:  df[df.columns.drop(nar_cols)]

    But we don’t just want the green block, we want to combine it with the blue block, and later with the orange block and so on, for all the narrative columns. Luckily we already made a list of the narrative columns (nar_cols) and earlier we saw that we can use a for-loop to, one by one, get every value in a list. So by saying for col in nar_cols we first get narrative_1 (the blue block), then narrative_2 (the orange block) etc. Now we can change the selection we made earlier to add the narrative columns, one by one. So we say: df[df.columns.drop(nar_cols).tolist() + [col]] for col in nar_cols

    Now the old narrative columns had names like: “narrative_X” where X was a number. But since we are going to make one narrative column we no longer need the numbers so we rename the column, and then we store everything in a list called nar_dfs. 

    All this Pandas magic, combined with another list comprehension looks like this in code (the red \ is just to break one line of code onto multiple lines):

    The final step is to paste all these blocks on top of eachother. For this we can use the Pandas concat function and save the new dataframe calling it df2.

    You could actually put the concat function around the 2nd line, but although you want your code to be condense, it is also important that you and other people can still read it. 

    So there you have it. Three lines of code to convert a very wide table to a very long table. 

    I hope you could follow along and I hope I could convey to you the elegance of a piece of code like this. It always makes me very happy, so I wanted to share it with you. 

    If you have any questions about this code, or if you would like me to explain something else, please let me know in the comments below or get in touch.

    Leave a comment

    Your email address will not be published.