Chip Oglesby

An online portfolio and notebook about the future of journalism.

Tag: data relationships

The problem with more data: cliff notes edition

If you would like to read my previous post, you can find it here. It goes into more depth than this post explaining more of the transparency cycle.

Hopefully this will serve as a reference guide for those looking to post their data such as check registers online.

All data should be machine readable

  1. The basic reason to post data online is to inform citizens.
  2. PDF’s are good for scanned pages, they are bad when they’re from computer generated programs.
  3. Data is much more useable when it’s in a machine readable format such as a csv file. It makes it easier for developers and designers to massage data into a format they need.
  4. Extracting information from PDF’s can be tedious and labor intensive. It’s easier to provide an open standard format for people to use.

All data is dirty

  1. The main question when releasing data is always: What if we release the wrong info?
  2. Data can be incorrect. Names can be misspelled, numbers can be input wrong, descriptions can be off.
  3. When working with data, always go to the source. Double check your sources.
  4. When exporting data, choose which information will best suit your consumers needs. The more information you can include, the better.

All data needs context

  1. Schools, government and municipalities needn’t waste time giving data context. Allow developers to take your information and do that for you.
  2. Developers need an easy way to access your information. CSV files and API’s are a developers best friend.
  3. Designers can work together with developers to best highlight data and make it more meaningful.
  4. If you feel compelled to give your data context, show examples highlighting your dataset. For example, how much did your school district spend per month on lodging and meals. How much as your government spent on cell phones and technology?

All data needs a central storage location

  1. Storing pdf’s by month on the same page is a good start, it gives people a way to categorize things, but is bad for computers.
  2. Data should be stored in a publicly accessible database such as socrata.
  3. Storing your data in socrata will centralize information, allowing for quicker, easier access to material.
  4. Databases should have methods of exporting data: JSON, REST, CSV.

All data needs action

  1. While pushing to have more information online is great, if no actions result from publishing, what good is it?
  2. Engaged citizens and advocacy groups need a way to export and share their findings through social media.

Further Reading: For more in-depth reading about data check out these resources.

  1. Civic Commons/OpenMuni Wiki: A great resource for any municipality looking to make the leap into the digital world. Case studies, best and worst practices and more.
  2. The five stars of open linked data: The father of the internet, Tim Berners-Lee explains why he wants to build a new internet using linked data and what we need to do to get there.
  3. Socrata: A free and paid service for municipalities to store their data online. Their basic service is free and prices increase depending on needs.
  4. The transparency cycle: From the sunlight foundation. This graphic and blog post explains why we must all work together.
  5. The eight principles of open government data: Government data shall be considered open if the data are made public in a way that complies with these eight principles.

The problem with ‘more data’

Recently, SCPC wrote about the problem with online check registers in county school districts. As more and more data is placed online, we need a way to standardize data so that has context and it’s not just sitting there. That’s what we would call ‘naked transparency.’

The naked transparency movement marries the power of network technology to the radical decline in the cost of collecting, storing, and distributing data. Its aim is to liberate that data, especially government data, so as to enable the public to process it and understand it better, or at least differently.

Before we rally the troops, we have to realize that getting more data, data that we own, from government officials on all levels doesn’t equal more transparency or accountability.

Data is only part of the Transparency Cycle

In a blog post by the Sunlight Foundation, they posted a very interesting graphic that shows how the ‘Transparency Cycle’ works. It has no beginning or end because it’s part of an ongoing process. Government Agencies (State Ethics Board) for example, are responsible for organizing data and giving web developers API’s who work with Graphic Designers who Give data context by visualizing it. Designers work with Journalists who help build public awareness through context and raising public awareness by reporting anomalies. Engaged Citizens work with Advocacy Groups who Organize and take action to hold the public and lawmakers accountable for what’s going on in government.

Tim Berners-Lee, the founder of the internet has envisioned a new type of web, one of linked data, where the dots are able to be connected. Berners-Lee gives five points of open linked data.

  1. make your stuff available on the web (whatever format)
  2. make it available as structured data (e.g. excel instead of image scan of a table)
  3. non-proprietary format (e.g. csv instead of excel)
  4. use URLs to identify things, so that people can point at your stuff
  5. link your data to other people’s data to provide context

State Comptroller Richard Eckstrom’s state government spending transparency site accomplishes 4 of the 5 goals, a great accomplishment in my opinion. Our school websites on the other hand, meet only one of the 5 requirements. PDF’s with no structure, give engaged citizens no way to ingest and analyze more than one month worth of data.

I was able to go in and scrap a PDF off of Berkley County’s transparency website and run the information through Many Eyes to get this chart that’s featured below. Ideally, there should be a simpler way for a developer or designer to visualize this information through API’s.

Eckstrom’s website is faced with the same type of problem. It focuses on the month to month expenditures, and if I want to build a database, I would have to download 12 separate .csv files to enter into another database to visualize.

All data is dirty

Once we’re able to actually collect data through publicly accessible API’s, does that necessarily mean the info is clean? Not really.

Since data input still relies on human-beings we are all prone to make mistakes. Remember the disaster of recovery.gov? There was a huge scandal because of all of the ‘ghost’ districts where money was being spent. The to main views here are simple “It happened on purpose, democrats are trying to steal/take our money” or “It was just a simple mistake, a slip of the finger or some congressional page didn’t know what district they were in.”

Also, if you browse the transparency data portal from the Sunlight Foundation and look for campaign contributions, names can be misspelled and instead of using proper nouns for occupation such as “Owner: Fast Bucks” a donor may simply list occupation as “store owner.”

This can lead to a few errors. It makes it hard to track who’s actually giving because a researcher will have to double check which company the donor works for to help connect the dots.

Data needs context

Once the data is published, it still needs context. PDF’s are good for looking at a small record, but what if we want to compare values over a given year, or the past six years? How do we know when a company a lobbyists represents gives a lawmaker money for his PAC so that he may be influenced to vote a certain way?

Spending all day pouring through massive amounts of information can be tedious and lead to the wrong conclusions. Instead, there should be automated processes in place that alert people via email, text, tweet when anomalies arise. Like the internet, quietly working in the background, but always on.

Designers and reporters also play an important roll in this because they can help clarify misunderstandings someone may have.

Data doesn’t equal transparency

Once we get the data, it’s been check to be accurate, and given context all is not complete in the transparency cycle. Government could publish every single bit of data it has, recorded votes, transit information, gis maps, but what good will it do if it just sits there?

It’s up to engaged citizens and Advocacy Groups to take the information from Designers, Developers, Journalists and Bloggers and form grassroots movements to hold government responsible. Data without action is done for naught.

Once Citizens and Groups organize and take action, they along with others can work with Lawmakers to actually make a change.

Transparency alone will not lead to more accountability in government. Data.gov and recovery.gov are great examples, Federal government have given citizens monitoring tools.

In South Carolina, we face battles of our own. South Carolina Senate, comprised of only 46 people cannot simply decide if they’ll vote on the record because they say it’s unconstitutional. They’ve also argued that verbal roll-call voting takes too long, and I agree, it does. But there are solutions out there. Open-source software can be written so that bills, amendments and earmarks can be posted online 72 hours early for the public can expect them, then house and senate members could vote on the bills so that we can connect the dots to see where change and influence is happening.

The question that South Carolina faces is: Who’s going to be first in the Transparency Cycle?

Retaining institutional knowledge

In our October SMC meeting Doug Fisher made a great point when he said “What other business leaves 99% of their raw material on the cutting room floor?”

That quote has given me plenty to think about when it comes to retaining ‘institutional knowledge‘ in the workroom.

Managing information

Newspapers have always been plagued by how they manage information. Most reporters use notepads and keep them tied up in boxes shoved away under their desk.

There may be some online content stored in individuals reporters directory but there’s no central repository of information available. At best, some reporters store all of their contacts in a MS word document. *ugh*

With staff layoffs most mid-sized newspapers have completely done away with their library staff, opting instead for a digital library such as Olive.

But what happens to the institutional knowledge when layoffs come? Think of all the history that a reporter takes with them. Every contact they have, every note they’ve taken may just as well walk out of the door with them.

Suggested software

So how do we tackle this mountain of data and inefficiency?

To begin with, reporters, editors and producers need to understand that their knowledge belongs to everyone in the newsroom. I know that some may find it shocking when you ask them to share their sources with others, but it’s time to stop playing this game and start collaborating as a team.

Next, newspapers should install and internally host their own free wiki site.

Within those pages, reporters and editors can create information-rich pages about every prominent business, councilman, elected official, high school and sports team they cover.

Take for example the public wikipedia page of South Carolina Governor Mark Sanford. This could easily be duplicated for newspapers and could include twice as much information because every story we’ve and everyone else has published would be linked to this page. It could also include contact information, known associates, political positions, campaign donors or whatever you could imagine.

Every time a reporter or editor learns new information about an individual subject it could be added to the wiki to help retain that much needed knowledge and context. Over time a huge database could be created and using API’s and metadata, it could also be connected to your photo archives and digital libraries giving you the ability to do some great data-mining.

A reporter could easily maintain their own pages by creating entries for people they cover like the mayor, the governor, head football coach, whoever and adding small nuggets of information over time. If the need arises or beats are swapped, then all of their knowledge moves right on to the next person who covers that beat.

The type of information that is retained could very, but newspapers could create a type of guideline for what should be kept on the wiki pages. If working on something confidential, a team could password protect their page, but this would be outside of the norm since we want everyone to collaborate.

All of this comes down to newspapers need to curate raw data and give it a place to reside for long term use. Newspapers need to do a much better job on connecting the dots internally as well as externally.

Issues to consider

One thing to take into consideration is where you want to host this wiki. Do you want to store it on the internet and allow your readers to collaborate with you or do you want to store it internally behind a firewall available only from within the building and via VPN?

If there were a way to password protect certain parts of the page, I would make the bold move to suggest that it be publicly available and ask your readers to contribute their collective knowledge. Obviously there would still be a need to fact check everything that readers post.

Another issue to consider is how to get reporters and editors excited about doing something like this. There are certain types of people (like me) who could sit around and semantically tag blogs and multimedia all day, and then there are others who are lucky if they even check their work emails once a week.

Sometimes you’ll see a strong push for something exciting like this in the very beginning, like writing a company blog, but it slowly tapers off over time, So keeping folks interested will also be an obstacle.

The bottom line

As more papers face more cutbacks and layoffs our ‘institutional knowledge’ is going to keep on walking out the door an an alarming rate.

Setting up an internal wiki is only the beginning for what could be accomplished. With some basic software and data mining, reporters and editors could uncover a completely new set of data that will give their site premium content, but connecting the dots has to start somewhere. Where do we go from here?

Using data and augmented reality to help define local news

There is no longer denying the use of what we currently call “smartphones” will only continue to increase their capacity as technology becomes cheaper.

The way that we use our phones will also continue to change as more phones utilize what is known as Location Based Services or LBS which uses various methods of A-GPS.

This is a pretty new area for newspapers to start exploring and I would like to see more attention paid to local advertising using LBS.

I recently saw an article that described the idea of using an Augmented Reality app that runs on the Android Phone that showed nearby tweets and various other types of information. Wikitude: (Android) TwitAround: (iPhone)

The basic idea of TwitAround is that by using the phone’s accelerometer you can see real-time tweets happening around you.

We also know that data needs relationships and newspapers are historically good about gathering data. What they are not good at is how the record and distribute that information.

My idea is the build an application that harnesses all of this data and makes it available on your phone.

Examples

Example 1: You are a first time home buyer looking in the Rosewood area on Maple for a home. By simply pointing your phone at a home, you are instantly able to see MLS listings, tax parcel service look ups and average utility usage charges. You are also able to see local related stories, photos, tweets, video, crime stats and so forth.

Example 2: You are the same home buyer and you travel to the intersection of Wheat and Rosewood and come upon Hand Middle School where you children may attend. By pointing your phone at the school, you are able to see publicly accessible data such as SAT scores, teachers salaries, crime reports, stories about the school, historical context and more.

Example 3: You are at a high school football game where Hammond is playing Heathwood Hall. By pointing your phone at a jersey on the field, you would be able to see team roster, individual stats, results in various weather conditions, past games, photos, videos and tweets.

Example 4: You are are at the museum of art and want to know more about the painting you are looking at. By pointing your phone, you are able to see historical context, painters bio, similar paintings and more.

A business model

In a virtual interview that I did with Dan Conover, I found this quote to be interesting

“The issue with augmented reality, then, isn’t the technology. You need a platform that communicates it, a system that structures and creates it, a business model that understands its value and how to communicate it, and user devices and software agents that accurately interpret and negotiate it. The issue is content and how to pay for it. ”

The problem is that we need a business model that rewards someone for adding value (i.e., meaningful content that people actually want). Until that happens, then every business that approaches augmented reality is going to treat it as just another way of delivering no-cost crap. It’s going to be mass-media executives trying to figure out how to use Facebook all over again. Business people tend to look at networked media as a way to make free money off of somebody else’s content, but there’s not going to be a sustainable business here until we work out the connections and expectations and exchanges..

While what Dan is saying is correct, I don’t think that it will be an entire ‘crap in, crap out’ model either. Just as Twitter has become popular, so will it’s ability to filter tweets through geolocation.

What we need is a better way to rate and log information through various algorithms that will sort the good from the bad. Part of the connections that we need to work out will be taking and filtering raw data as Berner Lee suggested, but also pulling content from our own archives and making that available through various API’s.

Mindy McAdams also raises an interesting point in here post ‘Augmented Reality: a business model.’

Each view of a node can be tracked. Each visit to the node can be tabulated. I think the opportunities for selling would be fantastic — the whole process could be automated. The advertiser pays a small fee to have the privilege of viewing all visits to a node. This is like micro-metrics for local businesses. The fee is necessary because you want it to be monthly or yearly, and you want it tied to a true identity. The account can be modified to allow advertisers to input and update their own coupons, etc. Then they pay per ad, per length of time, per update, etc. But it’s all hands-free for the entity that owns the app.

Not only would this tie in well with local advertisers, it would also open an entirely new stream of revenue we haven’t previously seen. It’s hard to answer the question of “how are we going to make money off of this?” because we’ve never done it before. The closest thing we’ve ever had to this would be a ‘bar database.’

Drawbacks

There are some drawbacks to LBS:

Results indicate that A-GPS locations obtained using the 3G iPhone are much less accurate than those from regular autonomous GPS units (average median error of 8 m for ten 20-minute field tests) but appear sufficient for most Location Based Services (LBS). WiFi locations using the 3G iPhone are much less accurate (median error of 74 m for 58 observations) and fail to meet the published accuracy specifications.

but that’s something we’ll have to address in another post.

Steps to getting started

1. You data will have to be available in a raw format. Hopefully, you’ll be able to use the COPE method, or the more controversial hnews for your information.
2. Your data will have to be given relationships and linked to other data.
3. Your data will have to be given a specific longitude, latitude for future reference.
4. You’ll can build your own publish platform or you can use openly available API’s like Layar.
5. All of your photos and stories will require stronger semantic data. No more incomplete information.
6. You’ll have to actually have a team who can code all of this for you.

Conclusion

Where we go from here really depends on how much news organizations want to invest in this type of technology. At the very least, we can take small steps by adding value to our stories through our Content Management System by using keywords and physical locations if they support it. (Hint: MNI does!)