Python or Java for ETL?

by Allison Zhang, Data Engineer, Virtulytix

In my prior blog, we discussed the differences between Python and R for data scientists. But before data scientists can analyze the data, there is one more important process, which is data ETL (Extract, Transform and Load).


As always, the answer is it depends.


From a novice’s view, Python is easier than Java. If this is your first time dive into the world of data or programming, Python can give you a quick introduction to key ETL concepts. Python is more readable than any other language. Its simple syntax is straightforward for everyone from experts to novice programmers. All you need to focus on is how to make the program produce your desired output. . This simplicity has made Python an extremely popular language within the business world and academia. A recent survey from the Association for Computing Machinery (ACM) found that Python has surpassed Java as the most popular language to introduce students to programming.

Once you understand basic programming concepts, it is time for you to move on to other more complex languages. This does not mean that Python is not an advanced programing language but, on the contrary, Python can achieve complex projects also. But more factors need to be considered at this stage.

Python is flexible. It is dynamically typed, which means Python performs type checking at runtime. This means that you do not need to declare the variable type when creating a variable . On the other hand, Java is statically typed, it has to declare the variables before the value can be assigned. So, Python is more flexible and can save time and space when running the scripts. But it might cause issues at runtime and it is slower than Java because of the type checking process. Java is strict. Strict typing makes it easier to provide autocompletion for Java. The compiler cab prevents you from mixing different kinds of data together. This is very helpful in the data engineering field. When a program consists of hundreds of files, it is too easy to get confused and make mistakes, and the more checks we have on our programs, the better off we are.

Screen Shot 2018-09-10 at 4.02.46 PM.png

When it comes to performance, Java is a better choice since it is more efficient. Java is more efficient when it comes to performance speed thanks to its optimizations and virtual machine execution. Just because of Python’s flexibility, performance is slowed down, which makes Java more attractive in this perspective. Java’s history in enterprise and its slightly more verbose coding style mean that Java legacy systems are typically larger and more numerous than Python’s. The “write once, run anywhere” design philosophy adopted by Java makes it unique in nature. In addition, it is extremely scalable making it the numero-uno choice for enterprise level development. And for large scale data, Java is always a better choice. It is faster and more efficient. For instance, Apache are written in Java.

After all, which language to choose depends on the scalability, performance, and purpose you want to achieve. For very large datasets, Java performs better than Python because of the factors discussed above. However, when performance is not that important, both languages are suitable for data engineers.

Is the ACLU right about facial recognition?

Facial recognition is back in the spotlight again thanks to a recent test of Amazon’s “Rekognition” product performed by the American Civil Liberties Union (ACLU).  Rekognition is a system, provided as a service by Amazon, used to identify a person or object based on an image or video (commonly known as ‘Facial Recognition’). This facial recognition technology is one of many new products to take advantage of new advancements in machine learning.

The goal of this test by the ACLU was to gauge the accuracy of the Rekognition product. To perform this test, a database was created from 25,000 publicly available arrest images. Headshots from all 535 members of congress where then compared, using the Rekognition service, against the arrest image database.  The result was that the Rekognition service produced 28 incorrect matches. Therefore, this means that according to Rekognition, 28 members of congress had also been arrested or incarcerated (which is not true).  Immediately following, the ACLU started sounding alarm bells regarding the accuracy of facial recognition and the ethical implications of its use by law enforcement. A 5% rate of error from any law enforcement tool is an understandable cause for concern. After all, these errors could add up to thousands of innocent American’s lives being ruined.

The problem is that while the ACLU’s concerns are valid, their testing of the Rekognition service is not. When using the Rekognition service there is a parameter for a confidence threshold. This threshold is the level of certainty that the Rekognition service must have for it to consider a match as being correct. If the threshold is set to 50%, the service must have 50% or more confidence that two pictures contain the same individual or object for it to be considered a match. During the ACLU’s testing, the required confidence interval was set to 80%. This is a major flaw in the ACLU’s testing methodology. An 80% confidence interval is normally used for recognizing basic objects, such as a chair or a basketball, not a human face. Amazon’s own recommendation to law enforcement is to use a 95% confidence interval when looking to identify individuals with a reasonable level of certainty.

The debate that the ACLU is trying to start regarding facial recognition and law enforcements is an important one. As new machine learning technologies are implemented by law enforcement, exercises such as these will be extremely important, but they must be done correctly. If the ACLU had used a 95% confidence interval, then this test would be a much better exercise for determining the validity of facial recognition technology as a law enforcement tool. Another major improvement that the ACLU could make to their exercise is to increase the sample size. Using the members of congress as an example does make for a compelling story, but a sample size of 535 individuals is still small. Overall the ACLU is demonizing a technology that it does not even know how to correctly use. This will only lower people’s confidence in the ACLU, not facial recognition technology.

Keeping a Close Eye on Your MPS program

Sales translate into profits for your company. However, in today’s cut throat competitive environment, is that enough? 

Manager Print Services (MPS) programs are set up to curb client costs while enabling smooth operation of the fleet and payed for based on usage. Virtulytix uses advanced analytics to deliver the same benefits to MPS dealers.

What are dashboards?

Dashboards are information visualization tools that can be used to monitor, analyze and help in decision making. There are 3 types of dashboards:

1)     Operational dashboards

2)     Strategic dashboards

3)     Analytical dashboards

In this blog we will analyze operational dashboards in the imaging industry with the help of examples.

An operational dashboard is used to monitor the day to day business. These dashboards are viewed anywhere from every minute to a couple of times a day. In terms of the imaging industry, dashboards would help:

1)     Supervisors to track service representatives out on duty

2)     Dealerships with manual toner fulfillment to make informed shipping decisions

3)     Warehouse manager to track demand and supply

What does Virtulytix bring to the table?

Aside from real-time dashboards to monitor the current transactions in the MPS environments, Virtulytix adds predictive and prescriptive insights to these dashboards to improve operational efficiency. Let us consider a couple of the above-mentioned use-cases.

1)     What if supervisors had knowledge of a fuser failure that would occur tomorrow at the same site where his service representative is fixing a paper jam today? Predictive analytics would help curb costs.

NR 7.26 Blog pic 1.png

This dashboard is intended to provide supervisors with a high-level view of the operations for the day. At a glance the supervisor is aware of the current load on each of his service representatives, the locations that they will cover, the types of incidents and the volume. Predicted failures close to the service sites are visualized at the bottom right hand corner along with the probability of it occurring within the next 5 days.

1)     In case of dealership with manual toner fulfillment, if the supplies department could know the cartridges predicted to run empty in the next "n" days and ship it to the clients without the clients having to raise requests.


NR 7.26 Blog Pic 2.png

This dashboard lists the cartridges predicted to run empty in the next 10 days and compares the estimated days to the days required for shipping. Cartridges which require immediate shipping are highlighted in red to avoid any further delay.

Each MPS dealer has different requirements, operational dashboards can be customized to help MPS dealers keep a close eye on their assets and increase operational efficiency. Analytical insights can be visualized on these dashboards to solve problems or meet certain objective. How to guide decision making with the help of strategic and analytical MPS dashboards will be covered in future blogs.

From Sensor to Solution: How to be Successful with Advanced Analytics

Advanced analytics is among the top buzzwords today due to the potential results this leading technology can bring. However, from statistics we’ve seen, efforts to develop and deploy these solutions fail as often as they succeed. Based on our experience, I’d like to review how we approach our customers and projects to give you some guidance to maximize your chance of success.

In any advanced analytics project, there are three main areas that have to be addressed. These are:

·      Business case

·      Solution architecture, technology and integration

·      Change management

For the first installment in this series, I will focus on business case. This should be the start of any new advanced analytics endeavor. One would think that all companies engage in an advanced analytics project with a clear business purpose in mind, but you’d be wrong. Too often, these projects are started with little focus leading to minimal results. With our customers, we begin with what we call a use case workshop. This typically involves a nominal fee to ensure the client has skin in the game. This workshop is designed to achieve multiple essential results including:

·      Identifying key pain points in the customer’s business.

·      Identifying what data the customer has to support the project and what gaps exist

·      Identifying key client stakeholders and decision makers

·      Determining key use cases for advanced analytics

·      Assessing the financial impact of each use case

·      Developing a decision matrix to determine which use case to focus on first

·      Developing success criteria for the project

·      Determine client’s comfort level with and ability to change

When we conduct these workshops, we work to ensure all of the major stake holders and decision makers are in the room so that consensus can be built with regard to the overall objective as well as the success criteria. It’s very important at an early stage to identify possible objections and work to overcome those. Additionally, this is a key opportunity to ensure that everyone is speaking the same language with regard to the solutions and technology. It’s also an opportunity to feel out the client’s ability to cope with significant process change. Value will not be delivered to the client unless they are able to leverage the solution created. If they are unwilling or unable to augment their processes to account for the new insights delivered by the solution, they will never see the value and the project as a whole will be unsuccessful.

Once we’ve explored the pain points and identified a few use cases, typically 3-5, we then do a financial analysis in collaboration with the customer. We use the customer’s costs as well as the expected outcome to identify what net benefit the customer can expect. We then use this analysis and weigh it against other variables such as feasibility of success, time to market, additional data sourcing and availability, integration difficultly (both technical and process focused), etc. to determine the best path forward. From there, we build out the proposal. This will include the key discoveries from the workshop including: project objective, success criteria, schedule, roles and responsibilities, etc. along with pricing and methodology. The key benefit to this methodology is that you have had to the opportunity to uncover what’s most important to the client and build consensus with the key stake holders prior to presenting a proposal. This prep work takes time, but it allows you to have a clear path on what is to be achieved, what the challenges are, and what the path for success looks like. Having this in place before you start developing models or cleaning data places you on the right foot to maximize your chance for success. In the next installment will discuss how to architect these solutions and the key pitfalls to look out for. Stay tuned!

Advanced Analytics in the Horse Capital of the World

by Allison Traynor, Project Manager, Virtulytix

Here, in the Horse Capital of the World, you can find a perfect combination of Advanced Analytics and Horse Racing. Horse Racing, the sport of Kings, has become more popular in recent years, attracting more fans in part due to the improving economy and the new efforts by the industry to reach out to up and coming fans. Those who are in this industry are those who love the sport and the magnificent horses who run the races. Recently at Keeneland Racetrack, there was a Welfare and Safety Summit for the horse racing industry on research and measures taken to ensure safety and integrity at the racetrack and additional common sense measures. One of the most interesting presentations was done by Tim Parkin, a professor of veterinary epidemiology who is conducting research with advanced analytics, Parkin is attempting to create and deploy a model that predicts with a high level of accuracy which horse for which race will suffer a non-fatal or fatal injury.

I was deeply surprised to have found the prevalence of predictive analytics within horseracing. Since 2009, fatal injuries have declined 20% during Thoroughbred races, however we still can do better. Using the Equine Injury Database (EID), from 2009 to 2017, there is data gathered from 3.1 million starts from nearly 150,000 horses. However, as he discussed they are missing data, to the point where they have nearly a quarter of their factors deemed undetermined. To deploy the model in a production setting it is important to identify what is missing. They are missing about 50% of non-fatal injuries in racing and will further need data on training injuries as well. Now within an industry that questions all possible scenarios, I question what else could take effect. Could it be the horse’s determination on race day, or the consistency of the dirt? Could it be their physical build? The physical appearance measurements of a horse are very important in how they race, the width and angle between their legs, the depth of their hindquarters, and the height of their withers. All these are important when a horse looks to run, not having those features may mean they are more prone to injuries.

Predictive analytics, and their capabilities are proving to within a short amount of time how invaluable the technology is. By using technology and predictive algorithms, individuals and organizations can analyze their systems with advanced capabilities. As Dr. Tim Parkin presented, there are endless use cases for advanced analytics. At Virtulytix, we are discovering and realizing that the applications of advanced analytics are basically incomparable, here in the Horse Capital of the World.

What does the future hold for office printer manufacturers?

Most of you know that the printing market is declining and will continue to decline as fewer and fewer printed pages are being replaced by electronic documents. This has an impact not only on new printer shipments but more importantly on the printer installed base which is what drives consumable and maintenance revenues. Why is this significant?   Well, because consumables represent about 65% of total print revenue and a higher percentage of profits to printer manufacturers and this is why printer manufacturers spend huge amounts of money in legal fees defending their printer cartridge patents.

Are You Operating At Only 85%?

Being a successful office products provider is getting tougher. Historically, if you did a reasonably good job of providing customer service, managed to maintain a steady stable of good sales people, and managed your expenses reasonably well - you could make really good money owning an office products dealership. Let's face it, all that printing and copying that was happening in offices provided you with a steady, annuity-based revenue stream. Life was good.