Deleted data is being misused to scam people, but it has also helped monitor Covid-19 around the world and benefited researchers
By Kiran N. Kumar
Facebook owner Meta has sued Octopus, a US subsidiary of a Chinese tech company and an individual from Turkey for mining data from Facebook, Instagram and other major tech platforms, sparking debate over its ethics as well. only on its advantages.
With over one million customers, Octopus offers paid scraping services to extract data from Amazon, eBay, Twitter, Yelp, Google, Target, Walmart, Indeed, LinkedIn, Facebook and Instagram.
Turkey-based Ekrem Ates used automated Instagram accounts to scrape data from 350,000 Instagram users and post it to his own websites or “ceclone sites”.
Lis: With Big Brother watching, what’s your choice now? (May 16, 2022)
The software was able to collect data on Facebook users, email addresses, phone numbers, gender and date of birth and on Instagram it collected subscriber data, information such as name , the user’s profile URL, location, and number of likes and comments per post.
Meta has been fighting the threat of cloning for a long time and has managed to reduce the number of 100 different Instagram clone sites to ten, as the recovered data is misused to scam people and damage the credibility of the original Facebook or Instagram sites. of Meta.
In fact, data scraping has helped monitor Covid-19 around the world and has also benefited researchers in medicine, law, and even environmental protection.
Scraping data during the pandemic
As many will remember, Ensheng Dong and his team at Johns Hopkins University created a Covid 19 dashboard in January 2020, which has become the barometer of governments and scientists around the world.
A systems engineer from the University of Baltimore, Maryland, Dong and his team have found scraping useful in obtaining data from Wuhan in China, where the Covid-19 outbreak was first reported.
As the outbreak grew into a pandemic and the Covid-19 dashboard became the only authentic source requiring a high proportion of scalability, Dang and his team turned to web scraping to capture information from thousands of websites. and report them in a spreadsheet without human intervention.
“For the first time in human history, we can follow in real time what is happening with a global pandemic,” Dong told Nature.
Scalable tool for researchers
Web scraping is not new. Alex Luscombe, a criminologist at the University of Toronto in Canada, uses scraping to monitor law enforcement practices in the country, while Phill Cassey, a conservation biologist at the University of Adelaide in Australia, has been engaged in monitoring the global wildlife trade on Internet Forums with scraping.
Georgia Richards, an epidemiologist at the University of Oxford, UK, reviews coroners’ reports for preventable causes of death. “There are so many resources and so much information available online,” says Richards. “It’s just sitting there waiting for someone to come and use it.”
Today, scraping has evolved with sophisticated tools commercially available from service providers such as Mozenda and ScrapeSimple who charge $250 per month for scraping.
But many scholars still prefer open source alternatives such as the Beautiful Soup package, or Selenium, and RSelenium, where they can rely more on these platforms to customize them.
Web scraping has its own challenges
For example, Cassey found that the monitoring of illegal animal sales is much more dynamic. Forums hosting such transactions appear and disappear without warning and the culprits use dubious and misleading names for plants and animals. For one parrot species in particular, the team said they found 28 “trade names”.
Chaowei Yang, a geospatial researcher at George Mason University in Fairfax, Va., cites another challenge because most data is locked in PDF documents and JPEG image files, which cannot be mined using conventional scraping tools.
Some websites refuse to share data legally. “I work against tons of powerful criminal justice agencies that really have no interest in me having data on the race of the people they arrest,” Yang says.
Researchers at the University Hospital of Saint-Étienne in France anonymized user IDs when scraping medical forums to identify adverse events associated with medications.
Read: Meta sues Chinese company’s US subsidiary for scraping Facebook and Instagram data (July 6, 2022)
But the danger of contextual cues can still reveal their identity, says Bissan Audeh, who helped develop the tool as a postdoctoral researcher in Bousquet’s lab. “No anonymization is perfect,” she says.
However, respecting the rules of ethical scraping is considered good practice, although it involves a long process and is as effective as manual scraping.
Even the Johns Hopkins Covid Dashboard team faced similar ethical issues, as the deleted data urgently needed fact-checking to verify its accuracy, necessitating an army of multilingual volunteers to decipher the Covid-19 reports. 19 from each country.