Overview
Teaching: 0 min Exercises: 60 minQuestions
Objectives
In this exercise we will explore data in QGIS and do more complex spatial queries.
We will start by interfacing the crimes dataset with census tracts. For this we will need the census tract data that includes geospatial encoding (vector polygons) so we can import these into a GIS.
SELECT c2.id, c1.population, c1.popdensity, c1.vacant_housing, c2.geom FROM
(SELECT "Census Tract"::float/100 AS CT,
"Total Population, 2010" AS population,
"Persons per Square Mile, 2010" AS popdensity,
"Vacant Housing Units, 2010" AS vacant_housing
FROM census_data) AS c1,
(SELECT id, name10::float AS CT, geom FROM census_tracts) AS c2
WHERE c1.CT = c2.CT;
A common task is dealing with incomplete or inaccurate data. For example, the stated location of a crime may not perfectly overlap with its address, so we need to do a distance-based query rather than checking for perfect overlap.
Spatial SQL also lets you ask complicated questions about how data are related. In this document, we will show some questions and how they can be answered with spatial SQL queries.
We often have to ask many-to-many questions in spatial queries, relating, for example, an entire table’s geometries to another table’s geometries. As you might imagine, by default it takes a very long time (M x N results for T1 with M rows and T2 with N rows).
Spatial indices can speed up these queries, however. The basic principle of a spatial index is that a new table (actually an index) is made knows how the geometries are ordered on the X direction as well as the Y direction, and can therefore make simplifying assumptions about when to stop searching for, e.g., the closest point.
http://revenant.ca/www/postgis/workshop/indexing.html
CREATE INDEX seattlecrimeincidents_gix ON seattlecrimeincidents USING GIST (geom);
SELECT ST_Distance(sc.geom, ST_SetSRID(ST_MakePoint(-122.3072131, 47.6212378), 4326))
FROM seattlecrimeincidents sc
WHERE "Offense Type" = 'THEFT-OTH'
ORDER BY sc.geom <-> ST_SetSRID(ST_MakePoint(-122.3072131, 47.6212378), 4326)
LIMIT 5;
If you’ve used PostGIS distance queries before, you might be wondering what’s up with that <-> thing. That’s a distance operator, roughly equivalent to ST_Distance, but is “index-assisted” and is fast in an ORDER BY clause.
More info: http://postgis.net/docs/manual-2.2/geometry_distance_knn.html
Joins! You can use JOIN to make a query that combines info between tables. Often, the tables being joined are defined under separate names+schemas, but you can also join on subqueries, as the output of a SELECT is still a table (algebra of SQL!).
You may have previously seen left/right joins, where you can combine tables based on something like a shared attribute. This implies a 1-to-1 relationship between the two tables. In the following examples, we are more interested in doing many-to-many comparisons, which is facilitated well by a CROSS JOIN.
A CROSS JOIN generates the cartesian product of the two tables, so e.g. for T1 CROSS T2, row 1 in T1 will be compared to every row in T2, and then the same will happen for row 2, and so on.
SELECT sc1.*,
ST_Distance(sc1.geom, sc2.geom)
FROM (SELECT *
FROM seattlecrimeincidents
WHERE "Offense Type" = 'THEFT-OTH') sc1
CROSS JOIN (SELECT *
FROM seattlecrimeincidents
WHERE "Offense Type" = 'TRESPASS') sc2
LIMIT 5
Note: you would remove the last ‘LIMIT 5’ to get the full result. You could also add another ‘WHERE’ clause at the end to filter the result to a more specific question.
We could also do this for a set of points (e.g. another table). Let’s say we’re interested in knowing how close together thefts and trespasses are reported.
This will involve the use of a new feature as of Postgres 9.3: Lateral joins. In previous versions / other databases, you would use a stored procedure (basically a function) that finds the closest trespass for a given point, but the lateral joins lets you do it on the fly.
First, let’s just look at the query we would run if there was only one trespass event that we cared about:
// This won't run - sc1 is not defined
SELECT geom
FROM seattlecrimeincidents
WHERE "Offense Type" = 'TRESPASS'
ORDER BY sc1.geom <-> geom
LIMIT 1
This query can be turned into a subquery in a lateral join such that it runs on every row in sc1.
SELECT sc1.*,
ST_Distance(sc1.geom, sc2.geom)
FROM (SELECT *
FROM seattlecrimeincidents
WHERE "Offense Type" = 'THEFT-OTH') sc1
CROSS JOIN LATERAL ( SELECT geom
FROM seattlecrimeincidents
WHERE "Offense Type" = 'TRESPASS'
ORDER BY sc1.geom <-> geom
LIMIT 1) sc2
LIMIT 5
Note: you would remove the last ‘LIMIT 5’ to get the full result.
Maybe we want to use the numbers we got in the last query in an analysis. Do those events co-occur (or at least, occur in the same vicinity) more often than average?
SELECT sc1.*,
ST_Distance(sc1.geom, sc2.geom)
FROM (SELECT *
FROM seattlecrimeincidents
WHERE "Offense Type" = 'THEFT-OTH') sc1
CROSS JOIN LATERAL ( SELECT geom
FROM seattlecrimeincidents
WHERE "Offense Type" != 'THEFT-OTH'
ORDER BY sc1.geom <-> geom
LIMIT 1) sc2
LIMIT 5
Aggregation functions are a useful way to summarize columsn. For example, we could take our previous results and compute averages:
// Should be ~.003005
SELECT AVG(ST_Distance(sc1.geom, sc2.geom))
FROM (SELECT *
FROM seattlecrimeincidents
WHERE "Offense Type" = 'THEFT-OTH') sc1
CROSS JOIN LATERAL ( SELECT geom
FROM seattlecrimeincidents
WHERE "Offense Type" = 'TRESPASS'
ORDER BY sc1.geom <-> geom
LIMIT 1) sc2
// Should be ~.0001877
SELECT AVG(ST_Distance(sc1.geom, sc2.geom))
FROM (SELECT *
FROM seattlecrimeincidents
WHERE "Offense Type" = 'THEFT-OTH') sc1
CROSS JOIN LATERAL ( SELECT geom
FROM seattlecrimeincidents
WHERE "Offense Type" != 'THEFT-OTH'
ORDER BY sc1.geom <-> geom
LIMIT 1) sc2
Okay, well that’s pretty good. We can compute averages, but that glosses over a lot of details.
But since we’re using SQL, we can ask a more specific question in the same vein: for every trespassing event, what is its closest theft and its closest crime in general? That will give us a dataset for which we could explore a distribution of phenomena, much more interesting!
SELECT sc1.*,
ST_Distance(sc1.geom, sc3.geom) / NULLIF(ST_Distance(sc1.geom, sc2.geom))
FROM (SELECT *
FROM seattlecrimeincidents
WHERE "Offense Type" = 'THEFT-OTH') sc1
CROSS JOIN LATERAL ( SELECT geom
FROM seattlecrimeincidents
WHERE "Offense Type" = 'TRESPASS'
ORDER BY sc1.geom <-> geom
LIMIT 1) sc2
CROSS JOIN LATERAL ( SELECT geom
FROM seattlecrimeincidents
WHERE "Offense Type" != 'THEFT-OTH'
ORDER BY sc1.geom <-> geom
LIMIT 1) sc3
We will want a snippet of data for the next lesson. This will involve running an SQL query, then saving it with QGIS.
In QGIS, connect to the database with the seattlecrimes dataset (if not already done).
In QGIS, click (in menu) Database > DB manager, select the dssg2016 database, and open the SQL Window using File > SQL Window, clicking its icon, or hitting ‘F2’.
Input this into the SQL Query form:
SELECT * FROM seattlecrimeincidents LIMIT 100;
Then click ‘Execute’. By default, we get back the result as a table.
To save the query’s result to a QGIS layer, check ‘Load as new layer’, choose ‘gid’ as the ‘Column with unique integer values’, ‘geom’ as the ‘Geometry Column’, and ‘first100’ as the Layer name (prefix).
Click ‘Load now!’
The should now be a new layer in your QGIS session. Note that it has not been saved to file yet.
Export by right clicking on the layer (‘first100’), clicking ‘Save as’, and choosing ‘Format’ to be GeoJSON.
To load a file, click the ‘plus’ button on the top left, then click the top (upload) option in the submenu.
Right click on the first100 layer and click ‘view extent of layers’
Buffer takes any shape (point, line, or polygon) and creates a new polygon that is basically an outline of the input shape, but ‘buffered’ out by some distance.
Click ‘Buffer’ in dropchop.io with your first100 layer selected.
Select ‘500’ for distance, and change the unit to meters.
Click Execute.
Uncheck ‘first100’ to make it invisible
Click ‘Center’ and hit ‘Execute’. This calculates the ‘center of mass’ of the overall shape.
Click ‘Centroid’ and hit execute. This calculates another form of ‘center’.
Click ‘Envelope’. This generates a minimal bounding rectangle of your geometries.
This is not well-supported in Dropchop.io - chalk talk
Key Points