Month: March 2019

China makes unprecedented offers to US on tech transfer, trade

China wants the United States to lift its tariffs as part of a deal. Washington, which is cognizant that the tariffs give it leverage to ensure Beijing follows through on any commitments it makes, is wary of lifting them right away. Trump said last week the United States may leave tariffs on Chinese goods for a “substantial period” to ensure compliance. “Some tariffs will stay,” the second official said. “There’s going to be some give on that, but we’re not going to get rid of all the tariffs. We can’t.” The topic will be addressed in upcoming talks. “Obviously that

The post China makes unprecedented offers to US on tech transfer, trade first appeared on

China pledges to expand opening its financial market as the US trade delegation arrives

Li also sought on Thursday to ease investors’ concerns over China’s cooling economy, saying Beijing has enough policy tools to fight a “hard battle.” Li said China will cut “real interest rate levels” and lower financing costs for Chinese companies, but did not elaborate on which interest rate he was referring to. Li had made similar comments in a speech earlier this month. Positive changes in China’s economy in March exceeded the government’s expectations, he said. Some analysts say shockingly weak industrial profit data on Wednesday have added urgency for more policy easing. China’s industrial firms posted their worst slump

The post China pledges to expand opening its financial market as the US trade delegation arrives first appeared on

Optimism about economy dips but Americans still feel it is in good shape under Trump: CNBC survey

Michael Reynolds | Getty Images US President Donald J. Trump listens to remarks from Prime Minister of Israel Benjamin Netanyahu (unseen), before signing an order recognizing Golan Heights as Israeli territory, in the Diplomatic Reception Room of the White House Trump signed an order recognizing Golan Heights as Israeli territory. Despite rising concerns about growth this year, the CNBC All-America Economic Survey finds confidence in the economy holding up, dropping from the optimistic heights of last year but maintaining relatively strong levels.

The post Optimism about economy dips but Americans still feel it is in good shape under Trump: CNBC survey first appeared on

Trump moves toward China trade deal and USMCA after Mueller report

The high-stakes trade decisions in Washington do not end there. Trump has also accused Europe of unfair trade practices, and sees tariffs on European cars as one means to address them. The move would come with its own political risks. Trump’s tariff policy has sparked more backlash from Republicans on Capitol Hill than just about anything the president has done since he took office. GOP lawmakers have in particular questioned the national security justification the Trump administration used to put duties on steel and aluminum imports last year. A group of lawmakers from both major parties led by Sen. Pat

The post Trump moves toward China trade deal and USMCA after Mueller report first appeared on

Introduction to CRISP-DM and Data Preprocessing

Introduction to CRISP-DM and Data Preprocessing

Introduction to CRISP-DM and Data Preprocessing

Introduction to CRISP-DM and Data Preprocessing 1
Here’s a great interview question for a data scientist position. How much time do you allocate to data preprocessing vs modeling?

While building models is exciting, high performing models require high quality data as input. If we feed our model poor quality input data, then we obtain faulty output. As the saying goes Garbage In, Garbage Out. Even state-of-the-art machine learning methods will have poor performance if trained with the wrong data, because models learn exclusively from the training data (and not the data we wish they had). So we need to take our time to ensure that the training data is consistent and error-free.

The general accepted answer for our interview question is 80% for data preprocessing and 20% for modeling. Improving the data upstream brings benefits for all downstream steps.

CRISP-DM (CRoss-Industry Standard Process for Data Mining) is the industry standard process for a data science project.  Data preprocessing is the third out of six phases in CRISP-DM. Here it is an outline of the phases:

1. Business understanding involves the understanding objectives, creating a project plan and defining performance metrics.

2. Data understanding includes data collection, data explorations as well as ensuring that we have high quality data.

3. Data preparation is what we call data preprocessing and it is the topic of this blog and its follow-up post.

4. Modeling. Once data has been preprocessed in a suitable format for the machine learning task, it is used to train models and tune parameters. This phase can also include model assessment by a domain expert.

5. Evaluation reviews the model according to the business objectives. For instance, we might understand that the model doesn’t cover some edge cases and we need to collect additional data to investigate further.

6. Deployment presents the outcomes in a convenient and accessible format to the users.

We can group data preprocessing into two steps, data cleaning and feature engineering. The former transforms the raw data into consistent data and is done just once. The latter (which we’ll look at in the next blog post) transforms consistent data into a specific format for each machine learning method. So if we are going to apply five different methods, then we need to perform five different feature engineering pipelines.

We can use a toy dataset to explain data preprocessing concepts. The dataset contains information on customers of a hypothetical e-commerce website (HEW). The independent variables are the username (name), age, city, salary, number of visited pages (pages), number of unique sessions (sessions), number of visited products, whether the member clicked on an advertisement of a specific product (click). The dependent variable is whether the member purchased the product currently on promotion (purchased).

The dataset is the following:

Introduction to CRISP-DM and Data Preprocessing 2
Last year, the HEW marketing department showed the advertisement to all its customers, but this year it wants to target only those likely to buy that product. As data scientist, our job is to predict those most likely to buy.

Throughout this and the following blog, we preprocess this toy dataset for the predictive task.

Data cleaning

Edwin de Jonge and Mark van der Loo break the data cleaning into three main steps:

raw data or our input data

• technically correct data or raw data in an organized tabular format. In our salary variable, “50000” (numeric) and “60000” (numeric) belong to the variable domain, whereas “high” (string) is not. The HEW toy dataset is already at this stage

consistent data transforms technically correct data into a format suitable for machine learning.

This is the resulting data cleaning pipeline overview:

raw data → technically correct data → consistent data

The first two steps are highly dependent on the programming language we use. We may read data in a tabular format using read.table in R and read_table in python using the pandas library. Then, we convert variables according to their type (values of age and salary variables will be converted as numerical values using `as.numeric` function in R). We focus the remaining of this section on the third step, that is building consistent data from technically correct data, and we briefly touch on two topics: handling missing values and handling inconsistencies.

Missing values

Real world datasets often have several variables with missing values. The common behavior of machine learning models is to remove any entry with missing values. Unfortunately, this removes lots of useful data. Understanding the nature of the missing data may helps us keep the most of the information.

There are four main reasons for which a variable may have missing values:

missing completely at random (MCAR), where the missing salary value is likely to be one of the other possible salary values.

missing at random (MAR) when the missing variable is correlated with the observed data, but not with the missing data. That is, the salary value is missing because of the values in other recorded variables.

missing that depends on unobserved predictors, is where some values are missing because information was not recorded (unobserved).

• Finally missing that depends on the missing value itself, is where we have some values that are more likely to be missing than others. For example, we don’t see salaries above 500000 in the data and no-answer is given instead.

The last two categories are often grouped together under the name of missing not at random (MNAR).

Proper handling of missing values will be covered in a future blog post. For the moment we are going to remove user3 and user8 in the HEW dataset because they have one missing value each.

Introduction to CRISP-DM and Data Preprocessing 3


Some reported values may not belong to the variable domain. For example if we find a “-2” (numeric) in the age variable. Other inconsistencies may be related to rules among variables. An example when someone’s age is 2 and driving licence is Yes. There has been a mistake in recording this information and such inconsistency must be addressed.

Understanding the meaning of the variables, with the help of a domain expert, will tackle this issue and a set of rules may check for inconsistencies.

The HEW dataset has user4 with age set to 1. This is clearly a mistake in the data retrieval and we decide to correct this value with the average ages of other users. The user4 has now age 33.

Introduction to CRISP-DM and Data Preprocessing 4


We introduced data preprocessing in the context of CRISP-DM cycle and delved into data cleaning to solve missing values and inconsistencies. What remains is feature engineering to ensure that data comply with the specific format for the machine learning model of interest. We have written more about that in next blog post.

Related Posts

  • Was this Helpful ?
  • Yes   No

The post Introduction to CRISP-DM and Data Preprocessing appeared first on RIIS.

Daimler targets autonomous truck market with stake in robotics firm

German automaker Daimler, the sales leader of Class 8 tractor-trailer semis, is taking a majority ownership stake in U.S. autonomous vehicle technology firm Torc Robotics, Daimler said Friday. Neither company will disclose how much Daimler plans to invest or what percentage of Torc the German company will own. “Torc takes a practical approach to commercialization and offers advanced, road-ready technology, plus years of experience in heavy vehicles,” said Roger Nielsen, CEO of Daimler Trucks North America, in a statement announcing the deal. Torc Robotics is one of several firms developing and road testing technology for autonomous trucks. Embark and self-driving

The post Daimler targets autonomous truck market with stake in robotics firm first appeared on

10 tax changes you need to know for 2019

The start of a new tax year is the perfect time to spring-clean your finances. Check out these tax changes to see whether you’ll be better or worse off in the year ahead From a state pension rise that will boost retirees’ incomes to a rise in tax for car owners, we list the tax changes that will make a difference to the money you have in your wallet from 6 April onwards. 1 You’re likely to pay less income tax Screen Shot 2019-03-27 at 11.47.45.png From 6 April, most people will pay less income tax. The tax-free personal allowance

The post 10 tax changes you need to know for 2019 first appeared on

Avocados recalled in bulk following reports of Listeria

Avocado retailer Henry Avocado is recalling California-grown avocados sold in bulk after routine testing showed samples contained Listeria. Both conventional and organic avocados are being recalled. The packages were shipped to Arizona, California, Florida New Hampshire, North Carolina and Wisconsin. A spokesman for Henry Avocado did not immediately respond to a request for comment. The California plant did not begin packing avocados until January 2019, so all products are being recalled. Consumers can identify the products from the “Bravocado” sticker on the conventional avocados. The organic products have “organic” and “California” on the stickers rather than Bravocado. Henry Avocado also

The post Avocados recalled in bulk following reports of Listeria first appeared on