Analyzing Social Media : More Than Words Can Say

This post is a work in progress. First of all, this is an interesting story (and an even more interesting dataset).                                                      
--- I was going through the materials to showcase in this blog and came across a presentation me and my ESCP classmates did on Microsoft. It has stock prices graphs, revenue indicators, product portfolio and many other pieces of info that we scraped across multiple public sources.
First I thought about expanding on that and looking at behavior of Microsoft stocks more closely, but then, almost immediately, I had an idea - what about the people? Are people working for Microsoft any different from employees of other companies? And if they are, in what way? And this ultimately led to me searching for data of people’s social media accounts and this study (especially since it’s analyzing text rather than numerical values as in previous posts).

Not long after I came across this dataset which is data on nearly 160 000 social media accounts. It’s in MySQL Dump format, and for converting it to CSV using Python this tutorial is quite helpful.
Upon looking at what we have I’ve found out that the tables available are “multiple blogger”, “multiple facebook”, “multiple flickr”, “multiple google”, “multiple lastfm”, “multiple livejournal” , “multiple myspace”, “multiple picasaweb”, “multiple twitter” and “multiple youtube”. So, basically, every social network out there!

Out of these tables I chose Google, Youtube, Facebook,  Twitter and Blogger. The great thing about this dataset is that it’s versatile and contains truly a lot of data - but for this study the more superficial data points such as music, books, hobbies, films, emails and links, which are tailored to the person specifically, are omitted.

The fields interesting for the analysis are:

  • Age

  • Gender

  • Location (here - country, hometown, any other location indicators)

  • N of connections

  • Occupation (job)

  • Industry

  • Company

  • School / university

  • [possibly] last online for proxy of an activity online

  • The social media the data is taken from

So we promptly reduce the dataset:
                                   
What is interesting - my first assumption was that there records are completely unrelated across social networks.
However, upon making one table and counting unique gids, I found out that out of 120 000 lines only 35 000 are unique ones. Which means they’ re active across a few social media which is quite common.  
So, in fact, better gather all info about one user in one line.
Writing a script for a “long” dataset and displaying results by groups we get:

  • 5 location variables (all 5 media)

  • 3 occupation/job variables

  • 4 companies variables and 1 industry one (all except blogger)

  • 3 number of connections variables

  • 3 schools/university variables (google, facebook and youtube)

  • 2 gender and 2 age variables (blogger and facebook)

There is also last activity online.

The data
First thoughts on what to do with data:

  • Join the variables in one category together, delete replicating ones

  • Introduce one separator between words in one field

  • Delete signs like “?”, “!”, “@”

  • Average/max number of connections over all

  • Unclear on how to account for last time online (“3 hours ago” or a specific day do not say anything)

  • Most common locations (possibly a map), schools,companies, professions

  • Industry is only indicated for 300 people so can probably be removed

Let’s take a look at the data:

str(DATASET_SOCIAL_MEDIA_2)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    35265 obs. of  27 variables:
$ gid                   : chr "001.deepak" "02chan.com" "03devildog" "0404wyt" ...
$ gender_facebook       : chr NA NA "Male" NA ...
$ location_facebook     : chr NA NA "San Diego California" NA ...
$ schools_facebook      : chr NA NA "San Diego State University \\'99  Palomar \\'73 San" NA ...
$ companies_facebook    : chr NA NA "Caliburnus Enterprises" NA ...
$ age_group_facebook    : chr NA NA NA NA ...
$ n_connections_facebook: int  NA NA 1208 NA NA 509 NA NA 689 NA ...
$ occupation_google     : chr "student" "advertiser" "business &amp   internet services &amp consultin" "student" ...
$ companies_google      : chr NA NA "mcrd museum historical society   decor &amp styl" NA ...
$ schools_google        : chr "ryan international school  delhi" NA "san diego state university (mba - 1999)   uc san d" NA ...
$ organization_google   : chr NA NA "caliburnus enterprises" NA ...
$ location_google       : chr "pune india" "shanghai" "san diego  california" "&#38271 &#27801" ...
$ age_youtube           : int 23 27 NA 31 NA 24 NA NA 21 31 ...
$ location_youtube      : chr "India" "China" NA "United States" ...
$ last_active_youtube   : chr "Jul 22 2011" "Jul 25  2011" NA "8 months ago" ...
$ occupation_youtube    : chr NA NA NA NA ...
$ companies_youtube     : chr NA NA NA NA ...
$ schools_youtube       : chr NA NA NA NA ...
$ connections_youtube   : int 0 0 NA 0 8 2 0 NA 17 190 ...
$ location_twitter      : chr NA "shanghai" "San Diego  CA" "china" ...
$ last_active_twitter   : chr NA "Thu Jul 09 07:44:07 +0000 2009" "Sat Mar 13 00:56:05 +0000 2010" "Sat Sep 24 08:24:08 +0000 2011" ...
$ n_connections_twitter : int  NA 31 37 119 234 NA 163 27 NA 76 ...
$ gender_blogger        : chr NA NA NA NA ...
$ location_blogger      : chr NA NA NA NA ...
$ industry_blogger      : chr NA NA NA NA ...
$ occupation_blogger    : chr NA NA NA NA ...
$ connections_total     : int NA NA NA NA NA NA NA NA NA NA ...

Connections together (in every social media):

DATASET_SOCIAL_MEDIA_2$connections_total<-DATASET_SOCIAL_MEDIA_2$n_connections_facebook+DATASET_SOCIAL_MEDIA_2$n_connections_twitter
+DATASET_SOCIAL_MEDIA_2$connections_youtube
> #creating total number of connections
> summary(DATASET_SOCIAL_MEDIA_2$connections_total)
Min. 1st Qu.  Median Mean 3rd Qu.    Max. NA's
5 373     744 2804 1631 6148860   26840

Most of people have about 400 to 600 connections on Facebook.

hist(DATASET_SOCIAL_MEDIA_2$n_connections_facebook, breaks=100)


A curious thing is immediately clear - there are a lot more men than women for those likes who have the gender.  80% or more of what we are seeing filled in are men.

> length(which(DATASET_SOCIAL_MEDIA_2$gender_blogger=="Female"))
[1] 463
> length(which(DATASET_SOCIAL_MEDIA_2$gender_blogger=="Male"))
[1] 1582
> length(which(DATASET_SOCIAL_MEDIA_2$gender_facebook=="Female"))
[1] 2542
> length(which(DATASET_SOCIAL_MEDIA_2$gender_facebook=="Male"))
[1] 16139

Then we unite everything in one column, for example:

schools_cols <- c('schools_facebook', 'schools_google', 'schools_youtube')
> DATASET_SOCIAL_MEDIA_2$schools_all <- apply(DATASET_SOCIAL_MEDIA_2[, schools_cols], 1, paste, collapse = " ")

Replacing the NA values:

DATASET_SOCIAL_MEDIA_2$schools_all <- gsub('NA', '', DATASET_SOCIAL_MEDIA_2$schools_all)


The full code for this task (ongoing) can be found at my Github

.

Written on April 10, 2018