2019:Quality/Data Quality in Wikidata
This is an Accepted submission for the Quality space at Wikimania 2019. |
Description
editWithin a few years, Wikidata has developed into a central knowledge base for structured data through the collaborative efforts of Wikidata’s peer production community. One of the benefits of peer production is that knowledge is curated and maintained by a wide range of editors, with different cultural, experience and educational backgrounds, which hopefully results in potentially fewer biases and content-wise in a more diverse knowledge base.
Ensuring data quality is, thus, of utmost importance, as the goal of Wikidata is to “give more people more access to knowledge” and therefore, the data needs to be “fit for use by data consumers” (Wang et al., 1996). The Wikidata community has already developed methods and tools that monitor relative completeness (e.g., Recoin gadget Balaraman et al., 2018), encourage link validation and correction (e.g. Mix’N’Match) and help e ditors observe recent changes and identify vandalism. Moreover, the community started global discussions about relevant dimensions of data quality in a recent RFC that used a survey of Linked Data Quality methods as the debate’s starting point to better describe and categorize quality issues and add more quality aspects/ dimensions, with the goal of developing a data quality framework for Wikidata. Despite this progress, recent research has shown the dominant role of a Western perspective in the represented languages (Kaffee and Simperl, 2018), thus, more work needs to be done to strive for more knowledge diversity. It is therefore a major concern of data quality, to support such knowledge diversity and ensure that Wikidata covers a wide variety of topics, from various trustworthy sources, where facts can be contradictory.
In this talk, we would like to present a classification of existing tools for data quality monitoring and data quality assurance in the context of Wikidata (extending previous work), drawing the Wikimedia community’s attention to gaps and opportunities for editors and developers to improve the collaborative data management cycle. Additionally, we will provide a comparison of data quality management strategies in Wikidata and Wikipedia, and present a summary of scientific findings relevant to the topic.
Relationship to the theme
editThis session will address the conference theme — Wikimedia, Free Knowledge and the Sustainable Development Goals — in the following manner:
Data Quality can be interpreted as an orthogonal topic to the SDGs listed in the conference’s theme. Data in any of the domains of the SDGs needs to be of high quality in order to be consumed and used, let the purpose be to analyse gender inequality via Wikidata item descriptions (SDG 5 Gender Inequality), implement apps that help people gain knowledge in a specific domain (SDG 4 Education), or visualize a map to show the evolution of the climate disaster (SDG 13 Climate Action).
Session outcomes
editAt the end of the session, the following will have been achieved:
Attendees will learn about the status quo of data quality in Wikidata and we (hopefully together in a Q&A slot) will define a roadmap of concrete edit types, MediaWiki features and external tools to be developed in the upcoming future.
Session leader(s)
editCristina Sarasua (Username:Criscod), Universität Zürich sarasua@ifi.uzh.ch
Mariam Farda-Sarbas (Username:Mariamfs), Freie Universität Berlin mariam@zedat.fu-berlin.de
Claudia Müller-Birn (Username:Claudiamuellerbirn), Freie Universität Berlin clmb@inf.fu-berlin.de
Lydia Pintscher (Username:Lydia_Pintscher_(WMDE)), Wikimedia Deutschland lydia.pintscher@wikimedia.de
Session type
editEach Space at Wikimania 2019 will have specific format requests. The program design prioritises submissions which are future-oriented and directly engage the audience. The format of this submission is a:
- Discussion-based training workshop
Requirements
editThe session will work best with these conditions:
- Room:
Small classroom / lecture hall or round-table seating (with projector).
- Audience:
30 - 50 people
It would be desirable to have basic understanding of Wikidata’s data model and workflow. However, we would like to be inclusive and invite people from other related Wikimedia projects. So, time permitted, we could provide a flash introduction to what people need to know before we dive into the details of data quality in Wikidata.
- Recording:
We are in agreement with having this talk recorded and shared under a free-license.
Slides
editSlides can be found here:
Transcription of post-its gathered in the poster session and in this talk/workshop
editPoster
Question 1: What quality dimension should we better support or what task should be (more) automated?
- Fit for purpose
- How can we handle situations where Wikidata has actual better data than the “trusted” source?
- How to easily find if a P date already exist so that I don’t create a duplicate?
- Feature N importance?
- Changing Wikidata values … when editing Wikipedia article infobox
- Shape Expressions//Entity schemas
- How can I communicate the trust of a source across Wikipedia languages?
- Editathons for subject matter experts in field X who can translate labels for language Y (need user friendly tools)
- More transparency and easy-to-understand explanations that allow people, students to use the data
Question 2: On which data quality dimension do you work on?
- Identifiers interlinking
- Input from Wikipedians very welcome
- What’s complete depends on the context/ use case / community
- Gender perspective in Wikidata. Classify by gender
- Completeness because it is easy to tackle.
- Data Quality is often defined outside WD
- Ethical dimension of Google using the data
- {similarity of items sic.} John Adams! Example -> additional data needed
Question 3: On which data quality dimension would you like to work in the future?
- Diversity
- Interlinking (value gained for source)
- References needed for conflicting data
- External references (need to replace the internal ones)
- MORE TRUST
- External Identifiers
- Quality ranks for sources
Talk / Workshop Input
Q1: What critical data quality issues did you spot in Wikidata?
- One topic can have multiple Wikidata entries in different languages.
- Description in different languages translate differently.
- There are still old manual interwiki translations left in Wikidata.
- Better support for differing (and maybe even sourced) data values (like birthdate etc.).
- First showcase item I checked to get help was wrong → control showcase items more
- Not every entry has or even needs a reference.
- Wikipedians cannot create new / missing properties.
- Some questions about Wikidata are not answered in weeks.
- Data changes in Wikidata are often not seen by Wikipedians.
- Completeness and references.
- How to know when an item is properly complete.
- Duplication, lack of reference, data from Wikipedia.
- Different topics can have the same Wikidata item. Need to be separated.
- Sources.
- Url reference without date visited qualification.
- Wrong ontologies.
- Wrong type of instances or subClassOf.
- Duplicates.
- The major quality concern or unused potential of Wikidata design, is that many claims are unsourced.
- Confusion about properties pertaining to types of items (e.g., “location” or “located in the administrative entity” - usage is unclear to average Wikidata editor / contributor.
- Make it easier to do multiple Q references for save source, instead of having to copy/paste repeatedly.
- Gaps in subclass tree.
- Inconsistent ontology / property usage.
- BLP problems.
- Incorrect or inaccurate subclasses being used.
- Use of wrong properties / inconsistency.
- Wikiproject standards hidden.
- Lack of statements.
- Bad language mappings / false friends.
- Lack of references.
- Results of search seem to be overtaken by growing body of article citations; make it easier to filter results appropriate to search.
- Missing statements → automatically suggest a range of statements when type of item is chosen.
- People sometimes don’t understand complex constraints.
- Old imports from Wikipedia.
- Too many items have still no statements; they are difficult to improve or merge.
Q2: How should we organize data quality management in the community?
- Removing references if value has been modified.
- DQ Management better procedure.
- Tools to check duplicates.
- Tag the quality issue in the UI.
- Ask for more references.
- Introduce quality levels for data:
- Valid reference?
- Who brought it in?
- All quality dimensions respected?
- Give good sight to relevant changes (for articles on my list).
- Move automatic processes.
- Mediators between the contents and substance community and the technical community.
- Data quality night: more “canned” templates for common topics, with likely Q’s / properties
- Have a way to add a check/mark or something like that for verification of data Q’s by editor who didn’t make original Q / reference (could be automated verification maybe?)
- Measurement / queries tools to help spot weak quality e.g., bad quality image (low definition).
- Suggest referencing unreferenced statements through a list.
- “Gamify” by adding a tab when simple fixing task are listed (“todo” list).
- Create “wiki love” campaigns specific for data.
- Custom editing interfaces for special projects (art, wikicite, biographies).
- Concentrate more topical / thematic discussions.
- Standard queries and dashboard to spot bad data.
- Wikiproject banners more prominent in item new.
- More tools.
- More specific metrics.
- Addressing problems and point them out to work more collaborative.
- Statement specific talk / spaces.
- Coordinate a way for editors to show what you want comments on.
- Anti-vandalism bots
- Better coordination and discovery of project discussions.
- Promote the development of data models (findable!).Discuss quality and data models in specific wikiprojects; they need more advertisement.
- Improve multilingualism also in village punes. There are too many discussions in the project chat, which is difficult to follow.
- (Not familiar enough with the Wikidata community to give feedback).
Q3: What did we learn from Wikipedia and/or other WM projects?
- Keep people from different Wikiprojects involved and improve their understanding of the importance of Wikidata.
- It’s better to source statements from the beginning and not after a lot of time
- Problems of data modelling (e.g., a library should have 2 items: the institution and the building) should be solved quickly in order to avoid confusion or work and edit on.
- It’s possible to have a multilanguage and multicultural project (see for example Wikimedia Commons).
- Don’t rely on the information in other Wikipedias / projects.
- Show Wikipedians how they can profit from Wikidata.
- When I find a wrong fact, it is difficult to correct it. Result: I leave the wrong information at wiki.
- Don’t close things down, protect things too quickly.
- Be more open to institutions and “role accounts”
- Keep the community fun and innovative, don’t ossify.
- Inclusionism is a virtue.
- Be user friendly (e.g., create an equivalent of VisualEditor on Wikidata -- suggest list of statements, etc.)
- Talk pages don’t get used much, it seems.
- Structured data is a good thing! (across wiki projects).
- Better linking between related project across different wikimedia projects.
- Outreach in Education → more Wikidata Use in University Courses + Data Science.
- We have to work / learn a little bit more from each part of Wikipedia. E.g., developers, admin, user → feel every perspective.
- Like Wikimedia is one common language, utopic feeling, many people, many languages, and *one* goal.
- Global and local collaborative woking.
- Write now, cite later.
- The distance between discovering a mistake and correcting it, must be short and easy to use.
- Maybe like Wikipedia, Wikidata will work in practice, but not in theory.
- Be more clear on *where* discussion should take place.
- Translating items into more languages improves quality.
- That different camps (e.g., inclusionists vs. deletionists) are not opposites but complementary.
- Bottom-up via communities of interest → so, give it space to organise itself, don’t “organise it”. Bottom-up is the Wiki USP, don’t lose it!
- Referencing is [...] complicated.