Brothers in Alms: Bernie & Trump

Have two politicians ever despised one another more than Bernie Sanders and Donald Trump? Not likely. And yet as much as these two mortal combatants despise one another they do share a few things. As of September 2019 of the 3 million donations given to either Sanders or Trump 108 donors gave to both.

This project is really an excuse to try out Google’s BigQuery. It is cloud based SQL where one can store and analyze terabytes of data almost instantaneously and almost for free.

There were 4 steps to this project:
1. Download the Federal Election Data into a CSV spreadsheet and do some rudimentary cleaning there.
2. Further clean the data in R.
3. Mutate and filter the data using Google’s BigQuery SQL.
4. Graph it in Neo4J

Step 1: MAKING A SPREADSHEET OF THE DATA.

Open the FEC file from this page:

https://www.fec.gov/data/committee/C00580100/?cycle=2016&tab=filings

Click on the download csv button. NOTE the label is one month after the data. 5/1 to 5/30 file is labeled July.

Locate file and rename it June.

Open file in Libre or whatever your favorite spreadsheet program is. The top of the sheet is littered with extraneous detail. Delete these rows.

There are many more columns than are of any use to us . Open a new spreadsheet and paste the column headings below into it, then copy and paste the relevant columns from the original FEC file into this new spreadsheet. In this way we preserve the original data file, drop the extraneous , and fix the col names.

I confess I do not recall now why but I remember the best way to do this was to paste right on top of the column name and then when done, inserting a new blank row, and re- inserting the col names again.

name_last	name_first	address	city	state	zip	amnt	employer	occupation

We will be stacking all of our data sets together so it is critical that the columns are always in this order.

Save this sheet as something like Sanders_June_19_Select.csv

Step 2: CLEAN THE DATA IN R
Load tidyverse and Lubridate
library(tidyverse)
library(lubridate)

Next run these steps over the file for each month for each candidate.

Sanders_Oct_Nov_16_Select <- read_csv(“Sanders_Oct_Nov_16_Select.csv”)

View(Sanders_Oct_Nov_16_Select)

Sanders_Oct_Nov_16_Select$zip <-as.integer(gsub(‘[a-zA-Z]’, ”,Sanders_Oct_Nov_16_Select$zip))

Sanders_Oct_Nov_16_Select$amnt <- round(Sanders_Oct_Nov_16_Select$amnt)

Sanders_Oct_Nov_16_Select$date <- as.Date(as.character(Sanders_Oct_Nov_16_Select$date) , “%Y%m%d”)

dim(Sanders_Oct_Nov_16_Select)

Sanders_Oct_Nov_16_Select <- na.omit(na.omit(Sanders_Oct_Nov_16_Select, cols=c(“name_last”,”address”,”city”,”state”,”zip”,”amnt”,”date”)))

dim(Sanders_Oct_Nov_16_Select)

write_csv(Sanders_Oct_Nov_16_Select , “Sanders_Oct_Nov_16_Select_A.csv”)

Step 3: Mutate and filter the data using Google’s BigQuery SQL.

The first thing to do is to upload each file to Google’s cloud Storage. Once every file is uploaded to Google Cloud Storage from within BigQuery create a dataset.

* * *
A frustration saving note about Google’s BigQuery.
Upload your data to Google Cloud Storage and import it into BigQuery from there.
Bewilderingly, the name that Google Cloud Storage assigns needs to be tweaked to run in BigQuery.
This is the name assigned by Google Cloud Storage:
jf11579campaigncontributions:obama.Sanders_2016_through_Sept_2019_2
Note the full colon, change that to a decimal.
And then enclose the entire name between backticks NOT single quote marks.
It needs to look like this:
`jf11579campaigncontributions.obama.Sanders_2016_through_Sept_2019_2`
* * *

SELECT *
FROM `jf11579campaigncontributions.test.Test_bernie`
UNION ALL
SELECT *
FROM `jf11579campaigncontributions.test.Test_trump`

Once all of the files are imported into BigQuery the next thing to do is to join them all into a single table. I chose to separately join all of the Trump files together into one file and the Sanders files together in a separate file. Later I joined these two together.

SELECT *
FROM
`jf11579campaigncontributions.obama.Trump_Jan_16_Select_BB`
UNION ALL

SELECT *
FROM
`jf11579campaigncontributions.obama.Trump_Feb_16_Select_BB`
UNION ALL

SELECT *
FROM
`jf11579campaigncontributions.obama.Trump_March_16_Select_BB`
UNION ALL

SELECT *
FROM
`jf11579campaigncontributions.obama.Trump_April_16_Select_BB`
UNION ALL

SELECT *
FROM
`jf11579campaigncontributions.obama.Trump_May_16_Select_BB`
UNION ALL

SELECT *
FROM
`jf11579campaigncontributions.obama.Trump_June_16_Select_BB`
UNION ALL

SELECT *
FROM
`jf11579campaigncontributions.obama.Trump_July_16_Select_BB`
UNION ALL

SELECT *
FROM
`jf11579campaigncontributions.obama.Trump_Aug_16_Select`
UNION ALL

SELECT *
FROM
`jf11579campaigncontributions.obama.Trump_Sept_16_Select`
UNION ALL

SELECT *
FROM
`jf11579campaigncontributions.obama.Trump_Oct_16_Select`
UNION ALL

SELECT *
FROM
`jf11579campaigncontributions.obama.Trump_Oct_Nov_16_Select`
UNION ALL

SELECT *
FROM
`jf11579campaigncontributions.obama.Trump_Jan_Feb_March_19_C`
UNION ALL

SELECT *
FROM
`jf11579campaigncontributions.obama.Trump_April_May_June_C`
UNION ALL

SELECT *
FROM
`jf11579campaigncontributions.obama.Trump_July_Aug_Sept_19_Select_C`

Let us take a look at the table. The Trump file has 1.3 million rows

SELECT *
FROM
`jf11579campaigncontributions.obama.TRUMP_2016_thru_Sept_19_B`
Limit 3

name_last	name_first	address	city	state	zip	amnt	employer	occupation	date	candidate
Carson	Richard	POB 123	APO	AA	34022	300	U.S. DEPARTMENT OF STATE	MANAGEMENT OFFICER	2016-03-03	Trump
Carson	Richard	POB 123	APO	AA	34022	49	U.S. DEPARTMENT OF STATE	MANAGEMENT OFFICER	2016-03-10	Trump
Cool	Dwayne	POB 456	FPO	AE	9645	50	UNIVERSITY OF TENNESSEE	USN ELECTRONICS TECHNICIAN	2016-03-17	Trump

The Sanders donation data has 2million rows.
SELECT *
FROM
`jf11579campaigncontributions.obama.Sanders_2016_through_Sept_2019_2`
Limit 3

Here we join Trump and Sanders
SELECT *
FROM `jf11579campaigncontributions.obama.Sanders_2016_through_Sept_2019_2`
UNION ALL
SELECT *
FROM `jf11579campaigncontributions.obama.TRUMP_2016_thru_Sept_19_B`

But before we get into the weeds with actual data let us have a run through with a practice data set first.

Here is a practice Sanders test data

name_first	name_last	address	zip	state	amount	candidate
ann	adams	1 A	11111	NY	5	bernie
agatha	boxer	2 B	11111	NY	10	bernie
attilda	carlson	3 C	11222	CT	15	bernie
annabelle	davidson	4 d	12345	NJ	15	bernie
alexa	ericson	5 fifith	12566	NJ	10	bernie
amelia	frazer	6 F	11111	NJ	10	bernie
abigail	grant	7 seventy	11510	NJ	5	bernie
bill	gunter	1 first	11111	NY	20	bernie
amy	harris	8 H	11545	CT	10	bernie
ben	harris	2 second	21212	NJ	25	bernie
athena	izzo	9 I	11579	NY	15	bernie
angela	james	10J	11579	NY	20	bernie

And here is the Trump test data

name_first	name_last	address	zip	state	amount	candidate
ballontine	adams	2 B	55555	AL	20	trump
bain	boxer	4 A	66666	WY	5	trump
brian	frazer	5 fifth	11111	BY	10	trump
babson	gomez	1 A	11111	NY	5	trump
bill	gunter	1 first	11111	NY	20	trump
ben	harris	2 second	21212	NJ	25	trump
baker	hill	3 B	11579	NY	10	trump
bob	inez	3 third	31313	VT	30	trump
athena	izzo	9 I	11579	NY	15	trump
angela	james	10J	11579	NY	20	trump
brad	james	4 fourth	11222	CT	5	trump

Again, upload them to Google cloud then move them into BigQuery by creating a datasets. Once that is done join the two datasets.
SELECT *
FROM `jf11579campaigncontributions.test.Test_bernie`
UNION ALL
SELECT *
FROM `jf11579campaigncontributions.test.Test_trump`

Here is the combined Sanders and Trump test data.

name_first	name_last	address	zip	state	amount	candidate
attilda	carlson	3 C	11222	CT	15	bernie
amy	harris	8 H	11545	CT	10	bernie
annabelle	davidson	4 d	12345	NJ	15	bernie
alexa	ericson	5 fifith	12566	NJ	10	bernie
amelia	frazer	6 F	11111	NJ	10	bernie
abigail	grant	7 seventy	11510	NJ	5	bernie
ben	harris	2 second	21212	NJ	25	bernie
ann	adams	1 A	11111	NY	5	bernie
agatha	boxer	2 B	11111	NY	10	bernie
athena	izzo	9 I	11579	NY	15	bernie
angela	james	10J	11579	NY	20	bernie
bill	gunter	1 first	11111	NY	20	bernie
ballontine	adams	2 B	55555	AL	20	trump
baker	hill	3 B	11579	NY	10	trump
athena	izzo	9 I	11579	NY	15	trump
angela	james	10J	11579	NY	20	trump
bob	inez	3 third	31313	VT	30	trump
bill	gunter	1 first	11111	NY	20	trump
brian	frazer	5 fifth	11111	BY	10	trump
babson	gomez	1 A	11111	NY	5	trump
bain	boxer	4 A	66666	WY	5	trump
brad	james	4 fourth	11222	CT	5	trump
ben	harris	2 second	21212	NJ	25	trump

Here is a not small difference between SQL and BigQuery’s version of SQL. Big Query only offers –as of Jan 2020–a single version of filtering by intersection: INTERSECT DISTINCT. It is a combined GROUP BY and INNER JOIN.

Let us look for the same last name that appear in both Sander’s and Trump’s list of donors:

SELECT name_last
FROM `jf11579campaigncontributions.test.Test_bernie`
INTERSECT DISTINCT
SELECT name_last
FROM `jf11579campaigncontributions.test.Test_trump`

This returns 7 names:

A quick glance at the Sanders Trump joined table shows that there are more than one harris and frazer so which harris and frazer contributed to both candidates?

Let us add more criteria, name_first:
SELECT name_last, name_first
FROM `jf11579campaigncontributions.test.Test_bernie`
INTERSECT DISTINCT
SELECT name_last, name_first
FROM `jf11579campaigncontributions.test.Test_trump`

Our results drop from 7 to 4.

And if we add in address our results are unchanged.

SELECT name_last, name_first,address
FROM `jf11579campaigncontributions.test.Test_bernie`
INTERSECT DISTINCT
SELECT name_last, name_first, address
FROM `jf11579campaigncontributions.test.Test_trump`

Now back to our real world data.
MOST unfortunately our real world data do not work out so simply.

Let us finally compare the two tables:
SELECT name_last
FROM `jf11579campaigncontributions.obama.Sanders_2016_through_Sept_2019_2`
INTERSECT DISTINCT
SELECT name_last
FROM `jf11579campaigncontributions.obama.TRUMP_2016_thru_Sept_19_B`

This returns 1991 names. But this is not meaningful. We are only looking at names in common so far, not people in common. Let us add name_first to the criteria. This drops us down to 551 rows.

SELECT name_last , name_first
FROM `jf11579campaigncontributions.obama.Sanders_2016_through_Sept_2019_2`
INTERSECT DISTINCT
SELECT name_last, name_first
FROM `jf11579campaigncontributions.obama.TRUMP_2016_thru_Sept_19_B`

Heartbreakingly , if we add address we get 0 results.
SELECT name_last , name_first, address
FROM `jf11579campaigncontributions.obama.Sanders_2016_through_Sept_2019_2`
INTERSECT DISTINCT
SELECT name_last, name_first, address
FROM `jf11579campaigncontributions.obama.TRUMP_2016_thru_Sept_19_B`

Interestingly, if we search only on matching addresses we get 11. Apparently there are 11 households where different people residing under the same roof where one donated for Sanders while the other donated for Trump. Meals must be quiet there.

SELECT address
FROM `jf11579campaigncontributions.obama.Sanders_2016_through_Sept_2019_2`
INTERSECT DISTINCT
SELECT address
FROM `jf11579campaigncontributions.obama.TRUMP_2016_thru_Sept_19_B`

But back to our question. If we search for matches on name_last and name_first we get 551 matches. If we add zip it drops to 3.
If we search name_last and name_first and state we get back 108

It would be nice if we could match on all eight criteria (name_last, name_first, address, city, state, zip, employer) but we cannot . Or maybe we can and the truth of the matter is that there are zero people who gave to both Trump and Sanders but out of 3 million donors I am guessing this is not the case.

If you have never donated online to a candidate, which is where this data comes from , one has to enter quite a bit of detail about yourself including employer and occupation. People are entering this themselves so the data should be accurate. It is my guess that people would prefer to give anonymously and resent having to disclose this much detail about themselves and so somewhat sabotage the data but at this point , this is only a guess.

Another possibility may be BigQuery’s limited JOIN capability. INTERSECT DISTINCT which is an inner join + group by might be giving us lower results than if we had done this in SQL. In any event next we will graph it using Neo4J

Step 4. GRAPH IN NEO4J

Step one clear out any old data.
MATCH (n)
DETACH DELETE n

Step two load the dat and create the nodes.
LOAD CSV WITH HEADERS FROM ‘file:///Bernie_Trump_108_Indexed.csv’ AS csvDATA
MERGE (candidateTrump:CandidateTrump{candidate:”trump”})
MERGE (candidateSanders:CandidateSanders{candidate:”sanders”})
MERGE (donor:Donor {LastName: csvDATA.name_last, FirstName: csvDATA.name_first, FullName: csvDATA.full_name, State: csvDATA.state})

FOREACH (_ IN case when csvDATA.candidate = ‘trump’ then [1] else [] end|
CREATE (donor)-[donorRel:DONATED]->(candidateTrump)
)
FOREACH (_ IN case when csvDATA.candidate = ‘sanders’ then [1] else [] end|
CREATE (donor)-[donorRel:DONATED]->(candidateSanders)
)

Step three graph.

MATCH p = (d:Donor)-[r:DONATED]->()
RETURN p

By Joe Foley
Comments Comments Off on Brothers in Alms: Bernie & Trump
Category General
Time 17 Minute Read

Graphs About The People Around Us

Brothers in Alms: Bernie & Trump

Step 1: MAKING A SPREADSHEET OF THE DATA.

Step 4. GRAPH IN NEO4J

Recent Posts

Recent Comments

Archives

Categories

Meta