Over the past year or so I have been working with some friends to create a 501(c)3 non-profit organization called Anidata. Anidata has a two-prong mission: 1) we provide mentorship opportunities where aspiring data scientists can get hands-on experience with real-world data problems in order to improve their skills and 2) we provide data science services to social good organizations that might not otherwise have access to data science capabilities.
Stop Human Trafficking
In our first project as an organization, we supported the Fulton County’s District Attorney’s office to help find, fight, investigate, and prosecute human trafficking crimes. In consultation with the DA’s office, we learned that many of the investigative steps require manually search adult services websites to build a case.
After gathering this initial information, we held a hackathon with Anidata members and law enforcement/DA staff to converge on a concept for a data product that would help them fight human trafficking.
The hackathon was a big success with over 30 attendees. Teams presented various project ideas that were in varying states of code completion. The common theme for a useful product that emerged was a searchable database of internet entities that post on adult ad websites.
The basic ides would be to allow investigators to take one piece of evidence they might have on an individual under investigation–such as a phone number– and search on that information in a database for all related phone numbers and email addresses. This helps the DA more efficiently link burner phones to people and posts so that they can build a case.
At the end of the hackathon we had many elements of what would eventually be this search product in place including: 1) a webscraper, 2) a luigi job framework for running daily jobs and dumping the output to a postgresql database, 3) cleaning algorithms to find and clean emails and phone numbers, and 4) an entity resolution algorithm to group postings by entity.
Entity Resolution (ER)
The core algorithmic challenge in this project was to create an ER algorithm that would group posts by posting entity and allow us to extract a set of emails and phone numbers from those posts to associate with that entity. Let’s start by taking a look at the data.
I’ve flatten several months of the database data and pruned it down to just the
columns we care about. In the following dataframe each row is a ad posting and
the metadata we were able to scrape from that posting. Sample data includes
three flat files that pair a
post_id with an email, user ID, or email address.
We have the following columns:
- phone number
- oid (poster unique ID)
I have the data for this post stored in GitHub. Let’s take a look at the head of the dataframe:
To get a better sense of the data, we can use
DataFrame.describe to find the
has some cool canned functions for examining data distributions by category. I
am particuraly fond
sns.swarmplot that makes plots like the one below. The
plot is a sampling of the distribution of first 1000 rows by region and poster
age. Strangely, there are a few posters with ages around 100 years old.
<matplotlib.axes._subplots.AxesSubplot at 0x13d7ab8d0>
After wrestling with the data a bit, we realized that we can conceptualize the data as a graph and then cluster posts into entities by finding the connected subgraphs. To do this, our graph has the following semantics:
- Vertices: Ad posts
- Edges: Common attributes (email, phone number, or poster ID)
Our approach is to create subgraphs using each type of edge information and then use subgraph clustering to find the entity subgraphs.
Explore Subgraph Sizes
The clustered subgraphs will be entities and each entity will have posted some number of ads. The block below analyzes the count of the most posted phone numbers. These are phone numbers that the poster put in the ad as contact information. The output shows that some phone numbers are associated with over a thousand posts.
One challenge is to efficiently create the sub graphs. Our first approach was to make fully connected graph out of the data subsets. But this has the short coming of creating a bunch of graph edges that we will later show are not necessary. With a fully connected graph of \(N\) nodes, we will have \(N \cdot (N-1)\) connections, which is a lot of graph data to store and reason about.
Here is an example of a phone number that is seen on 11 posts:
Fully Connected Subgraph
To create the subgraph we iterate through all pair-wise combinations of posts (a
cross join) and populate the networkx graph. You’ll notice in the function
below that we also add an attribute to each edge called
I made a plotting library called pyd3netviz to use d3 to
visualize network graphs in python (see my previous post about it). pyd3netviz
can take networkx node an edge attributes as directives for styling the plots.
In this example, out function takes in a rgb tuple called
color and uses that
to specify the edge color.
Phone Numbers Only
The example below shows the fully connected subgraph for one phone number (the graph is interactive). Out high-level ER objective is solved as long as these nodes are connected as a group. In this case is looks like have the graph fully connected is overkill.
Email Addresses Only
Looking at a set of carefully chosen email addresses, we see the same sort of graph. I’ve chosen these email addresses because they share a common post with the phone numbers from the previous example.
Email Address Graph
Combined Graph with Email and Phone Numbers
If we combine all the posts from the previous two examples and plot them, we
notice that they are part of a common ER subgraph that contains the email
email@example.com and the phone number
see in a minute that there are other phone numbers connected to this entity as
well. In the plot below the phone number edges are blue and the email address
edges are red. You may have to drag around some of the vertices in order to see
the red edges.
Simplifying The Graph
The process above works, but having a fully connected set of graphs ends up taking a bunch of memory. To simplify the graph for our problem, we only need each network of posts to be connected–__not__ fully connected. To do that, we can create subgraphs that are loosely connected loops instead of densely connected balls of edges.
Phone Numbers Only
Phone Numbers and Email Addresses
It is clear that the graph below satisfies our constraint of keeping these nodes connected, while requiring many fewer edges.
When viewed this way, a set of connected posts (vertices) and poster attributes (edges) constitute an entity. With that clear, we simply have to create a graph out of all connections and then find the disjoint subgraphs. The function below takes one form of edge information and makes a connectivity list.
Add Graphs for Each Type of Connection
For each identifying attribute, use
make_graph to generate the connectivity
list and then combine all lists.
Use NetworkX to Find Disjoint SubGraphs
Now that we have the mast graph, we can use NetworkX to find all subgraphs and then loop through them to find each distinct entity.
Check Entity Data
To check the entities we can start by just looking at the number of posts associated with each entity id. That is, the number of posts in each disjoint subgraph.
Merge With Original Data
Finally, we use the entity dataframe to join back in the post data so that we can can see the identity data for each entity.
To check the results, let’s look at the entity we started to manually analyze in the first examples. If we search this entity table by email address we see two phone numbers attached to the entity including the one we originally experimented with. That seems reasonable.
entity_id, we can see all the connected posts. We can see a
variety of phone numbers, emails, and post locations.
Finally, we can get a sense of the network by plotting a subset of 50 entities. There are several orphan posts that aren’t connected to an entity as well as several large entities with many connected posts. A quick glance at the plot shows that most of the ID information are phone numbers (blue edges), with a few email addresses (red) and poster_ids (green) sprinkled in.
NetworkX also has a plotting tool that we can use to see the graph. The block below shows how we use nx to make the same plot.
- Talk notebook on GitHub
- Talk slides on Google Slides
- Anidata 1.x code for this Stop Human Trafficking project is on GitHub