Friday, June 22, 2012

Free data should be free!

Fancy yourself a crime mapper/ information architect/ data analyst type? You might enjoy to read this article wherein SpotCrime.com discusses the design issues of relaying crime data to citizens, and the stickier wicket of getting timely public information from a public agency that's headed by people determined to contract with some friend-of-a-friend's-friend's for-profit company instead of releasing the information via free, public, no-barrier channels. 'Splains SpotCrime:
"We only see crime data delivered openly when a public agency - not commercial company - elects to use this method.  Private crime mapping companies have an incentive to balkanize the data they are contracted to publicize (typically funded by taxpayers) and limit entities from sharing this data (the same entities who are paying taxes). They may come across as open, however, they typically add a lengthy terms of service agreement to their maps and websites - ostensibly asking each citizen to agree to a contract before viewing public crime data.  These terms of service agreements have no public value.  They not only inhibit the press from republishing the data, but also serve to restrict public sharing of crime data."

9 comments:

Cham said...

The link to the spotcrime post is here.

Spotcrime tells us what I have already learned:

We’d like to argue that once an agency has decided to reveal the data to the public, the real value of the data is the ability of the data to be shared, not to be silo'ed in one proprietary map that stifles the full power of the Internet and social media.

Spotcrime also claims that: "However, to our knowledge, we are the only company who has released a massive amount of historical data to the public at no cost, and with no restrictions. We have the shortest terms of use. Here it is: “Be nice with our data and our website.” And we don’t encourage any agency to only share with us. "

So I decided to see what I could do. First I went to Spotcrime's homepage and requested all the crime in the 21230 area code. Once it generated a map I requested the data be opened in Google maps. That gave me this error message: "The specified directory has been deprecated." So no Google maps.

Then I asked for the same data to be opened in Google Earth. That generated a downloadable kmz maplet openable in Google Earth. Unfortunately, the data that came in to my Google Eart were shootings and thefts in the 21201 and 21202 area codes, and nothing for 21230. Then I tried to take that erroneous maplet and open it up in Notepad. Instead of a nice list of geocoded data I got some sort of text that had been encrypted.

So I moved further along to find out more about this freeflow of data from Spotcrime. I found this posting from Businesswire.

So in order to get access to this vast gigabyte of data of which we don't know what it includes, I found out by doing....that you have to go to this page and write them a note telling them that you agree to their TOS and to please send you a link to this treasure trove of crime data. I did that and I'm still waiting. Maybe nobody is in the office.

I'm beginning to appreciate OpenBaltimore.

Cham said...

Moving on with this project, at 7:38AM this morning I received an email from Spotcrime asking me to sign a disclaimer regarding their data. It was mostly about how they weren't responsible for the truthiness of their data. I filled it out and returned it.

Cham said...

Phase III of the spotcrime legacy data obtainment.

I just received and email from spotcrime with 4 files that are loaded onto some sort of Amazon ftp site. They are called 2010, 2009,2008 and 2007. 2011 data must be a little too fresh and new to give away as a freebie.

I'm downloading 2010 now, it's in a tar.gz file format which is some sort of compressed file. It's 358 Mb, so here in DSL world that's going to take some time. When I get it, if I get it, I'll report back. Can't wait to see what this thing contains and whether the data is shuffled. (Don't worry, I can unshuffle just about anything.

Cham said...

Phase IV

These large tar.gz files are a PITA. It's taken me two large downloadable unzipper programs to get the 2010 data into some sort of workable csv file. It's worse than I thought it would be. What I received after the initial tar.gz file was finally unzipped using an evaluation version of Winzip 16.1 was 12 csv files one for each month. But each file contains the data for every city that Spotcrime serves all shuffled together.

But it guess worse, my friends. The complete addresses for all crime locations are located in cells that occupy just one column, so it is going to take some extra work to split the data of that column into several columns so that it can be sorted by location.

This "legacy crime data" that Spotcrime claims is "free" is almost unusable, unless I spend some serious effort to address the data challenges that may or may not be successful.

Cham said...

Phase V
I took a Spotcrime Legacy Data file at random that I had converted to csv, in this case October 2010, and managed to split the cells successfully that contained the location data. So now I have a workable file.

So in conclusion, when it comes to the question as to whether Spotcrime offers "free" legacy data the answer would be yes with lots of caveats:

1. The most recent data you'll get is December of 2010.
2. You have to agree to SpotCrimes Terms of Service AND sign a waiver,as well as deal with some anonymous person on their end that may or may not agree to give you the data.
3. You need to have an Internet connection that can handle downloading a gigabyte of data.
4. You need to store a gigabyte of zipped data and then have the skillset and software to convert it to something usable.
5.You must have a database that can handle over 583,000 rows of data and computer powerful enough to move it. My computer has 3 Gb of RAM, was recently purchased and is moving rather slowly on the project.
6. You need to be handy enough with a spreadsheet to be able to split cells.

Spotcrime is well aware that most people couldn't be able to do all of this, and they know this. They're in business to earn money and they know most people will hand it over freely so they don't have to do what I just did.

Maurice Bradbury said...

So Cham, are you able to get data from Open Baltimore and work with it better? The thing is, too, all their data is two weeks old-- not helpful. If people are getting jacked in your neighborhood, a two-week delay for this info is not acceptable.

Half a mil for phones, but the city can't have someone dedicated to getting this info to the public in a timely way?

Cham said...

I've pulled lots of data from OpenBaltimore and have generated some handy blogposts from it here and here.

Since OpenBaltimore only concerns itself with Baltimore City the data it contains will be site-specific, so one wouldn't have many of the challenges the Spotcrime database has. A two-week age is better than 1.5 years. I'm not defending OpenBaltimore, in an ideal world crime data shouldn't be more than 48 hours old so people can know 10 of their neighbors have been burglarized recently, it might help them with their decision to lock their doors and close their windows.

However, any data is better than no data at all. I feel sorry for anyone who lives in an area that isn't given some stats with which to work so they know what is going on.

Eventually, when I have the patience I'll download all the legacy crime data from OpenBaltimore. But for applications sake, you might want to inquire with that anti-RoFo group in your neck of the woods, they're doing an OpenBaltimore-generated study on whether crime increases or decreases when one starts selling Western fries.

Cham said...

Curiosity got the better of me. Just to contrast and compare with Spotcrime I downloaded two files from OpenBaltimore.

I download in comma separate variable (csv) format because the format works well with my spreadsheet software and it can easily be converted to a txt file which is something I need when working with mapping programs for conversion to gpx file format.

The first file I downloaded with BPD arrests. It was 11 Mb, took a few minutes to download and had approximately 70,000 lines of data that encompassed all arrests between 01/01/2011 and 05/26/12.

The second file was BPD victim based crime data. It was 20 Mb, again took a few minutes to download and had approximately 200,000 lines of data that encompassed all reported crimes between 01/01/2007 and 05/26/2012.

Both files opened effortlessly with my software. The only minor problem with the data was that both long and lat occupied one cell, and that cell would have to be split in the spreadsheet in order to accommodate any gpx conversions. But that is pretty minor.

OpenBaltimore would be my preferred database over Spotcrime.

Stephen said...

In SpotCrime's defense, tar.gz files are a very common way to zip and compress files on linux web servers. The ".gz" part of the filename means that the file was compressed (whitespace removed) so the file would actually be much larger to download if it wasn't in that format.

That said, it would be nice if there was an easier way to only download data that you need (say Baltimore, MD crimes only) so the file size would be more manageable for the average user.