Corporate data is available from various authoritative and non-authoritative public and private sources, such as state registers, the private sector, and open data.
Collecting, cleaning and joining data from these sources together is a difficult and expensive task. We have based this assumption on the following observations:
- Not all jurisdictions collect the same data. Some are collecting only the bare minimum: the id, the name of the company and its address. Others extend this by including financial statements, responsible persons indexes, …
- Data properties can have very different meanings in different data sources. An example of that would be the name of a company’s legal status.
- There is no uniform way to identify the same companies across various jurisdictions.
- These jurisdictions might have different attitudes and laws regarding data privacy protections.
- Registers from different regions mostly store data in local languages.
- Data entries can be of poor quality while also lacking proper documentation.
Current data
For the purposes of being able to start building the business graph and accompanying analytics services, our efforts began by collecting data from various public and open data sources. Some of the sources contain complete and quality data like Companies House in the UK, and some European public registers offer official data for bulk download. We also scrape Web data from financial and business news sources, online retailers, employee indexes, and others.
The currently collected data has several million records of companies from the UK, Norway, Belgium, France, India, the U.S. and Australia. Ongoingly, we also collect information about trademarks and products related to the companies we cover.
Challenges
Each source needs its own parsing software to be developed. Data comes in different formats, mostly in CSV, but also in JSON and XML text files. HTML webpages are unstructured and messy to deal with. We employ automatic and semi-automatic approaches to, firstly, retrieve the data and, secondly, to extract and give structure to data that is key to our purposes.
After extraction is complete the data can be stored in an internal uniform format. A minimum set of attributes common to all company data sources needs to be defined. Importantly, these attributes need to be present or sensically linkable across all data sources. In order to easily match the companies in different jurisdictions, common shared identifiers for companies in Europe will need to be defined.
Multi-lingual data
Data in most state registers is stored in their official language, therefore understanding what each attribute means and how they compare to attributes in other data collections is crucial. For analyzing data from the Web and various global news sources we will use JSI’s Wikifier, which can detect organizations, people, and products from documents in different languages.
News data
News sources from across the world power JSI’s Event Registry, which we use to analyze and identify events related to specific companies. The collected articles are pre-processed in order to identify the entities and concepts mentioned in the articles. Date references in the articles are used to determine the date of the events.
Articles with similar content in other languages are identified using cross-lingual document linking techniques.
A clustering method is used to group similar articles together. With the help of cross-lingual article matching we also identify articles in different languages, which talk about the same newsworthy story. From articles in each event we then extract information about the time and place of the event, as well as relevant entities like companies or people in the public’s eye.
Empowered with data and analytics, we will try to infer information about important events related to each respective company, as well as develop tools to monitor the changing dynamics between companies in the European region.