Brothers in Alms: Bernie & Trump

Have two politicians ever despised one another more than Bernie Sanders and Donald Trump? Not likely. And yet as much as these two mortal combatants despise one another they do share a few things. As of September 2019 of the 3 million donations given to either Sanders or Trump 108 donors gave to both.

This project is really an excuse to try out Google’s BigQuery. It is cloud based SQL where one can store and analyze terabytes of data almost instantaneously and almost for free.

There were 4 steps to this project:
1. Download the Federal Election Data into a CSV spreadsheet and do some rudimentary cleaning there.
2. Further clean the data in R.
3. Mutate and filter the data using Google’s BigQuery SQL.
4. Graph it in Neo4J

Step 1: MAKING A SPREADSHEET OF THE DATA.

Open the FEC file from this page:

https://www.fec.gov/data/committee/C00580100/?cycle=2016&tab=filings

Click on the download csv button.  NOTE the label is one month after the data.  5/1 to 5/30 file is labeled July.

Locate file and rename it June.

Open file in Libre or whatever your favorite spreadsheet program is. The top of the sheet is littered with extraneous detail.  Delete these rows.

There are many more columns than are of any use to us . Open a new spreadsheet and paste the column headings below into it, then copy and paste the relevant columns from the original FEC file into this new spreadsheet.  In this way we preserve the original data file, drop the extraneous , and fix the col names.

I confess I do not recall now why  but I remember the best way to do this was to paste right on top of the column name and then when done, inserting a new blank row, and re- inserting the col names again.

name_lastname_firstaddresscitystatezipamntemployeroccupation

We will be stacking all of our data sets together so it is critical that the columns are always in this order.

Save this sheet as something like Sanders_June_19_Select.csv

Step 2: CLEAN THE DATA IN R
Load tidyverse and Lubridate
library(tidyverse)
library(lubridate)

Next run these steps over the file for each month for each candidate.

Sanders_Oct_Nov_16_Select <- read_csv(“Sanders_Oct_Nov_16_Select.csv”)

View(Sanders_Oct_Nov_16_Select)

Sanders_Oct_Nov_16_Select$zip <-as.integer(gsub(‘[a-zA-Z]’, ”,Sanders_Oct_Nov_16_Select$zip))

Sanders_Oct_Nov_16_Select$amnt <- round(Sanders_Oct_Nov_16_Select$amnt)

Sanders_Oct_Nov_16_Select$date <- as.Date(as.character(Sanders_Oct_Nov_16_Select$date) , “%Y%m%d”)

dim(Sanders_Oct_Nov_16_Select) 

Sanders_Oct_Nov_16_Select <- na.omit(na.omit(Sanders_Oct_Nov_16_Select, cols=c(“name_last”,”address”,”city”,”state”,”zip”,”amnt”,”date”)))

dim(Sanders_Oct_Nov_16_Select)

write_csv(Sanders_Oct_Nov_16_Select , “Sanders_Oct_Nov_16_Select_A.csv”)

 

Step 3: Mutate and filter the data using Google’s BigQuery SQL.

The first thing to do is to upload each file to Google’s cloud Storage. Once every file is uploaded to Google Cloud Storage from within BigQuery create a dataset.

* * *
A frustration saving note about Google’s BigQuery.
Upload your data to Google Cloud Storage and import it into BigQuery from there.
Bewilderingly, the name that Google Cloud Storage assigns needs to be tweaked to run in BigQuery.
This is the name assigned by Google Cloud Storage:
jf11579campaigncontributions:obama.Sanders_2016_through_Sept_2019_2
Note the full colon, change that to a decimal.
And then enclose the entire name between backticks NOT single quote marks.
It needs to look like this:
`jf11579campaigncontributions.obama.Sanders_2016_through_Sept_2019_2`
* * *

SELECT *
FROM `jf11579campaigncontributions.test.Test_bernie`
UNION ALL
SELECT *
FROM `jf11579campaigncontributions.test.Test_trump`

Once all of the files are imported into BigQuery the next thing to do is to join them all into a single table. I chose to separately join all of the Trump files together into one file and the Sanders files together in a separate file. Later I joined these two together.

SELECT *
FROM
`jf11579campaigncontributions.obama.Trump_Jan_16_Select_BB`
UNION ALL

SELECT *
FROM
`jf11579campaigncontributions.obama.Trump_Feb_16_Select_BB`
UNION ALL

SELECT *
FROM
`jf11579campaigncontributions.obama.Trump_March_16_Select_BB`
UNION ALL

SELECT *
FROM
`jf11579campaigncontributions.obama.Trump_April_16_Select_BB`
UNION ALL

SELECT *
FROM
`jf11579campaigncontributions.obama.Trump_May_16_Select_BB`
UNION ALL

SELECT *
FROM
`jf11579campaigncontributions.obama.Trump_June_16_Select_BB`
UNION ALL

SELECT *
FROM
`jf11579campaigncontributions.obama.Trump_July_16_Select_BB`
UNION ALL

SELECT *
FROM
`jf11579campaigncontributions.obama.Trump_Aug_16_Select`
UNION ALL

SELECT *
FROM
`jf11579campaigncontributions.obama.Trump_Sept_16_Select`
UNION ALL

SELECT *
FROM
`jf11579campaigncontributions.obama.Trump_Oct_16_Select`
UNION ALL

SELECT *
FROM
`jf11579campaigncontributions.obama.Trump_Oct_Nov_16_Select`
UNION ALL

SELECT *
FROM
`jf11579campaigncontributions.obama.Trump_Jan_Feb_March_19_C`
UNION ALL

SELECT *
FROM
`jf11579campaigncontributions.obama.Trump_April_May_June_C`
UNION ALL

SELECT *
FROM
`jf11579campaigncontributions.obama.Trump_July_Aug_Sept_19_Select_C`

Let us take a look at the table. The Trump file has 1.3 million rows

SELECT *
FROM
`jf11579campaigncontributions.obama.TRUMP_2016_thru_Sept_19_B`
Limit 3

name_lastname_firstaddresscitystatezipamntemployeroccupationdatecandidate
CarsonRichardPOB 123APOAA34022300U.S. DEPARTMENT OF STATEMANAGEMENT OFFICER2016-03-03Trump
CarsonRichardPOB 123APOAA3402249U.S. DEPARTMENT OF STATEMANAGEMENT OFFICER2016-03-10Trump
CoolDwaynePOB 456FPOAE964550UNIVERSITY OF TENNESSEEUSN ELECTRONICS TECHNICIAN2016-03-17Trump

The Sanders donation data has 2million rows.
SELECT *
FROM
`jf11579campaigncontributions.obama.Sanders_2016_through_Sept_2019_2`
Limit 3

Here we join Trump and Sanders
SELECT *
FROM `jf11579campaigncontributions.obama.Sanders_2016_through_Sept_2019_2`
UNION ALL
SELECT *
FROM `jf11579campaigncontributions.obama.TRUMP_2016_thru_Sept_19_B`

But before we get into the weeds with actual data let us have a run through with a practice data set first.

Here is a practice Sanders test data

name_firstname_lastaddresszipstateamountcandidate
annadams1 A11111NY5bernie
agathaboxer2 B11111NY10bernie
attildacarlson3 C11222CT15bernie
annabelledavidson4 d12345NJ15bernie
alexaericson5 fifith12566NJ10bernie
ameliafrazer6 F11111NJ10bernie
abigailgrant7 seventy11510NJ5bernie
billgunter1 first11111NY20bernie
amyharris8 H11545CT10bernie
benharris2 second21212NJ25bernie
athenaizzo9 I11579NY15bernie
angelajames10J11579NY20bernie

And here is the Trump test data

name_firstname_lastaddresszipstateamountcandidate
ballontineadams2 B55555AL20trump
bainboxer4 A66666WY5trump
brianfrazer5 fifth11111BY10trump
babsongomez1 A11111NY5trump
billgunter1 first11111NY20trump
benharris2 second21212NJ25trump
bakerhill3 B11579NY10trump
bobinez3 third31313VT30trump
athenaizzo9 I11579NY15trump
angelajames10J11579NY20trump
bradjames4 fourth11222CT5trump

Again, upload them to Google cloud then move them into BigQuery by creating a datasets. Once that is done join the two datasets.
SELECT *
FROM `jf11579campaigncontributions.test.Test_bernie`
UNION ALL
SELECT *
FROM `jf11579campaigncontributions.test.Test_trump`

Here is the combined Sanders and Trump test data.

name_firstname_lastaddresszipstateamountcandidate
attildacarlson3 C11222CT15bernie
amyharris8 H11545CT10bernie
annabelledavidson4 d12345NJ15bernie
alexaericson5 fifith12566NJ10bernie
ameliafrazer6 F11111NJ10bernie
abigailgrant7 seventy11510NJ5bernie
benharris2 second21212NJ25bernie
annadams1 A11111NY5bernie
agathaboxer2 B11111NY10bernie
athenaizzo9 I11579NY15bernie
angelajames10J11579NY20bernie
billgunter1 first11111NY20bernie
ballontineadams2 B55555AL20trump
bakerhill3 B11579NY10trump
athenaizzo9 I11579NY15trump
angelajames10J11579NY20trump
bobinez3 third31313VT30trump
billgunter1 first11111NY20trump
brianfrazer5 fifth11111BY10trump
babsongomez1 A11111NY5trump
bainboxer4 A66666WY5trump
bradjames4 fourth11222CT5trump
benharris2 second21212NJ25trump

Here is a not small difference between SQL and BigQuery’s version of SQL. Big Query only offers –as of Jan 2020–a single version of filtering by intersection: INTERSECT DISTINCT. It is a combined GROUP BY and INNER JOIN.

Let us look for the same last name that appear in both Sander’s and Trump’s list of donors:

SELECT name_last
FROM `jf11579campaigncontributions.test.Test_bernie`
INTERSECT DISTINCT
SELECT name_last
FROM `jf11579campaigncontributions.test.Test_trump`

This returns 7 names:

 

A quick glance at the Sanders Trump joined table shows that there are more than one harris and frazer so which harris and frazer contributed to both candidates?

Let us add more criteria, name_first:
SELECT name_last, name_first
FROM `jf11579campaigncontributions.test.Test_bernie`
INTERSECT DISTINCT
SELECT name_last, name_first
FROM `jf11579campaigncontributions.test.Test_trump`

Our results drop from 7 to 4.

And if we add in address our results are unchanged.

And if we add in address our results are unchanged.

SELECT name_last, name_first,address
FROM `jf11579campaigncontributions.test.Test_bernie`
INTERSECT DISTINCT
SELECT name_last, name_first, address
FROM `jf11579campaigncontributions.test.Test_trump`

Now back to our real world data.
MOST unfortunately our real world data do not work out so simply.

Let us finally compare the two tables:
SELECT name_last
FROM `jf11579campaigncontributions.obama.Sanders_2016_through_Sept_2019_2`
INTERSECT DISTINCT
SELECT name_last
FROM `jf11579campaigncontributions.obama.TRUMP_2016_thru_Sept_19_B`

This returns 1991 names. But this is not meaningful. We are only looking at names in common so far, not people in common. Let us add name_first to the criteria. This drops us down to 551 rows.

SELECT name_last , name_first
FROM `jf11579campaigncontributions.obama.Sanders_2016_through_Sept_2019_2`
INTERSECT DISTINCT
SELECT name_last, name_first
FROM `jf11579campaigncontributions.obama.TRUMP_2016_thru_Sept_19_B`

Heartbreakingly , if we add address we get 0 results.
SELECT name_last , name_first, address
FROM `jf11579campaigncontributions.obama.Sanders_2016_through_Sept_2019_2`
INTERSECT DISTINCT
SELECT name_last, name_first, address
FROM `jf11579campaigncontributions.obama.TRUMP_2016_thru_Sept_19_B`

Interestingly, if we search only on matching addresses we get 11. Apparently there are 11 households where different people residing under the same roof where one donated for Sanders while the other donated for Trump. Meals must be quiet there.

SELECT address
FROM `jf11579campaigncontributions.obama.Sanders_2016_through_Sept_2019_2`
INTERSECT DISTINCT
SELECT address
FROM `jf11579campaigncontributions.obama.TRUMP_2016_thru_Sept_19_B`

But back to our question. If we search for matches on name_last and name_first we get 551 matches. If we add zip it drops to 3.
If we search name_last and name_first and state we get back 108

It would be nice if we could match on all eight criteria (name_last, name_first, address, city, state, zip, employer) but we cannot . Or maybe we can and the truth of the matter is that there are zero people who gave to both Trump and Sanders but out of 3 million donors I am guessing this is not the case.

If you have never donated online to a candidate, which is where this data comes from , one has to enter quite a bit of detail about yourself including employer and occupation. People are entering this themselves so the data should be accurate. It is my guess that people would prefer to give anonymously and resent having to disclose this much detail about themselves and so somewhat sabotage the data but at this point , this is only a guess.

Another possibility may be BigQuery’s limited JOIN capability. INTERSECT DISTINCT which is an inner join + group by might be giving us lower results than if we had done this in SQL. In any event next we will graph it using Neo4J

Step 4. GRAPH IN NEO4J

Step one clear out any old data.
MATCH (n)
DETACH DELETE n

Step two load the dat and create  the nodes.  
LOAD CSV WITH HEADERS FROM ‘file:///Bernie_Trump_108_Indexed.csv’ AS csvDATA
MERGE (candidateTrump:CandidateTrump{candidate:”trump”})
MERGE (candidateSanders:CandidateSanders{candidate:”sanders”})
MERGE (donor:Donor {LastName: csvDATA.name_last, FirstName: csvDATA.name_first, FullName: csvDATA.full_name, State: csvDATA.state})

FOREACH (_ IN case when csvDATA.candidate = ‘trump’ then [1] else [] end|
CREATE (donor)-[donorRel:DONATED]->(candidateTrump)
)
FOREACH (_ IN case when csvDATA.candidate = ‘sanders’ then [1] else [] end|
CREATE (donor)-[donorRel:DONATED]->(candidateSanders)
)

Step three graph.

MATCH p = (d:Donor)-[r:DONATED]->()
RETURN p